Historic,  Archive  Document 

Do  not  assume  content  reflects  current 
scientific  knowledge,  policies,  or  practices. 


3?E 


UNITED  STATES  DEPARTMENT  OP  AGRI CULTURE 
Bureau  of  Agricultural  Economics 

CORRELATION  THEORY  AND  METHOD 
APPLIED  TO  AGRICULTURAL  RESEARCH 

By 

Bradford  B.  Smith 


Washington,  D.  C. 
August  1926 


Table  of  Contents 


Introduction  Page 

I.    The  Field  of  Correlation   2 

II.    Gross  Correlation 

1  Regression    5 

2  Correlation    12 

3  Summary  .  .  .   18 

4  Arithmetic  Methods    20 

III.     Correlation  Ratio   ,   36 

IV,     Correlation  Index    39 

V.    Multiple  Linear  Correlation 

1  Regression   43 

2  Correlation    51 

3  Determination,  part  and  partial 

correlation   ,  55 

4  Arithmetic  methods    65 

5  Use  of  Multiple  Correlation  Methods 

in  Fitting  Paraholae    67 

VI.    Multiple  Curvilinear  Correlation   69 

VII.    Joint  Relationships    88 

VIII.    Application  to  Time  Series    94 


Note ;  The  use  of  a  typewriter  with  standard  keyboard  for  this 
manuscript  has  made  necessary  in  a  few  cases  the  substitution  of  other 
characters  than  those  commonly  used,        She  summ&t i on  sign  (Greek 
letter  Sigma)  is  visually  expressed  by  (S) •      The  symbol  for  the  correla- 
tion ratio  "Y?    (Greek  letter  Eta)  is  expressed  by  (IT.) .         The  addition 
sign  is  in  many  cases,  expressed  by  the  symbol  (  4-  ) .      The  sign  for 
the  standard  deviation  is  expressed  by  the  symbol  (  0    ).      Uo  confusion 
should  result  from  the  use  of  these  symbols,  since  they  are  used  in  the 
manuscript  for  no  other  purpose. 

In  the  description  of  multiple  correlation,  the  symbol  representing 
a  net  regression  coefficient  is  underlined  (  b  )  ts  distinguish  it  from 
the  symbol  representing  one  of  the  independent  varables  (  b  ). 


■  CORRELATION  THEORY  AND  METHOD 
APPLIED  TO  AGRICULTURAL  RESEARCH 


Collected  and  prepared  for  the  use  of 
Statisticians  of  the  Bureau  of  Agricultural  Economics 

by 

Bradford  B.  Smith,  Economic,  Analyst, 
Division  of  Statistical  and  Historical  Research. 

Introduction 

In  no  ono  volume  are  the  theory  and  methods  of  correlation  as  applied 
to  agricultural  research  in  the  Bureau  of  Agricultural  Economics  "brought  to- 
gether in  a  form  readily  adapted  to  reference  purposes.    On  the  other  hand 
not  a  fev?  new  statistical  methods  have  "been  developed  "by  members  of  the 
staff.    This  publication  is  an  attempt  to  bring  together  and  coordinate  such 
methods.    In  order  to  be  complete  it  presents  usual  correlation  theory,  but 
from  a  point  of  view  easily  adaptable  to  include  the  more  recent  theory  and 
method.    The  approach  is  essentially  that  which  has  been  given  in  the  Gradu- 
ate School  of  the  Department  of  Agriculture  for  the  past  two  years  and  is 
very  similar  in  its  treatment  of  simple  correlation  to  that  found  in  Prof, 
Frederick  C.  Mill's  excellent  text,  "Statistical  Methods  Applied  to  Economics 
and  Busin3ss"  .    The  new  sxibject  matter  on  multiple  linear  and  curvilinear  cor 
relation,  joint  relationships,  application  to  time  series,  and  apportionment 
of  importance  to  contributing  variables  is  founded  partly  on  articles  pub- 
lished by  members  of  the  staff  and  partly  on  material  as  yet  unpublished. 
Statisticians  are  especially  indebted  to  H.  R.  Tolley  and  Mordecai  Ezekiel 
of  this  Bureau  for  the  notable  contributions  they  have  made  to  correlation 
methods,  designated  later  in  this  work.    Appreciation  is  also  extended  to 
E.  M.  Daggit  of  this  Bureau  for  assistance  in  preparing  the  manuscript. 


I      TEE  FI2LD  0?  COHEELATICN. 

The  biologist  studying  effects  of  changing  environmental 
factors  on  inheritance  finds  that  one  experiment  does  not  always 
precisely  verify  the  results  of  another;  for  influences  other  than 
the  specific  one  under  observation  are  at  work,  and  change  the  ap- 
parent effect  from  time  to  time. 

The  entomologist,  studying  the  effectiveness  of  practical 
methods  of  checking  the  inroads  of  pests,  frequently  has  the  true 
effect  obscured  by  the  interference  of  unmeasured  factors.  This 
prevents  his  establishing  the  true  quantitative  relations. 

The  economist  or  student  in  the  social  sciences   is 

especially  at  a  disadvantage  in  either  detecting  or  demonstrating 
the  social  laws  for  it  is  practically  impossible  to  secure  experi- 
mental conditions  when  dealing  with  mass  human  reactions:  con- 
ditions where  the  influence  of  all  factors  but  the  ones  under  consider- 
ation are  held  constant  throughout  the  recording  of  the  data.  There 
may  for  example  be  a  perfectly  simple  theoretical  relation  between 

the  price,  p,  and  the  supply,  s,  of  a  commodity,  such  as  p  =  _1  

a+  o  .  s ' 

But  in  practice  this  is  difficult  to  demonstrate,  for  other  factors 
influencing  the  price,  such  as  changing  demand,  quality  of  the 
product  and  the  value  of  gold,  through  their  changes  obscure  the 
relation.     Thus  that  which  appears  to  be  the  quantitative  relation 
in  one  case,  is  modified  in  the  second,  the  third  and  so  on.  What 


3. 


then  is  the  true  relation?    Correlation  method  is  "but  a  tool  for 
learning  from  such  variable  data,  taken  as  a  whole,  what  the  most 
probable  relation  is.    This  method  has  its  foundation  in  the 
theory  of  probability.     In  addition  to  giving  a  means  of  ascer- 
taining what  the  most  probable  relation  is,  it  gives  an  idea  as 
to  how  closely  this  most  probable  relation  is  fulfilled. 

Correlation  methods  are  evidently  not  needed  for  experi- 
ments in  the  natural  sciences  such  as  chemistry  and  physics  where 
by  exact  standardization  under  experimental  conditions  the  effect 
of  g£  given  cause  can  be  demonstrated  precisely  time  after  time. 
In  those  undertakings,  however,  where  we  are  unable  to  ascertain 
the  precise -relation  existing  between  one  factor  and  another  since 
other  factors  exert  a- variable  influence,  it  becomes  necessary  to 
draw  out' from  the  mass  of  often  conflicting  data  some  conclusion 
which  we  may  say  represents  on  the  whole  the  most  probable  relation. 
The  correlation  methods'  are  the  most  effective  tools  yet  devised 
for  dealing  with  such  situations.    They  owe  their  origination  large- 
1y  to  the  English  School  of  Biometricians ,  pre-eminently  Karl  Pearson. 
Recently  there  have  been  a  number  of  texts  applying  these  methods 
to  the  problems  of  the  social  sciences  as  distinguished  from  the 
biological.    Each  writer  has  his  own  particular  approach  to  the 
problem.    The  approach  presented  here  is  based  on  the  belief  that 
simple  correlation — the  correlation  of  two  variables — is  but 


the  beginning  and  the  least  effective  of  all  the  correlation 
methods  for  handling  economic  agricultural  problems.  Therefore 
any  treatment  of  simple  correlation  should  be  such  as  is  easily 
and  logically  expansible  to  include  multiple  and  partial  corre- 
lation, as  well  as  curvilinear  correlation.     Nearly  every  important 
economic  research  study  dealing  with'  quantitative  aspects  is  ap- 
preciably more  satisfactory  when  such  methods  are  used.-   This  is 
because  nearly  every  such  study  is  inevitably  confronted  with  the 
necessity  for  talcing  into  account  the  influence  of  several  related 
factors  in  detecting,  demonstrating  or  applying  economic  law. 
The  only  known  method  of  doing  this  is  the  multiple  correlation 

method.  This  should  not  be  interpreted  to  mean  that  the  corre- 
lation methods  are  by  any  means  sufficient.     They  have  definite 

and  serious  limitations;  so  much  so,  that  it  is  almost  safe  to 

say  that  they  are  misused  more  often  than  not.    However,  the  methods 

are  improving  continually  with  each  contribution. 


II    GROSS  CORRELATION 


1.  Regression. 

Suppose  that  a  series 'X,,  is  associated  with  a  series  Y,  so 
that  each  value  of  X  has' a  corresponding  value  of  Y.    These  may 
for  example,  represent  the  price  and  the  quantity  of  melons  sold 
on  different  days  in  a  central  market.;  n  may  represent  the  number 
of  pairs  of  'observations.  ; 

The  problem  of  studying  the  relation  between  the  two  series 
may  be  divided  into  two  parts:     (1)    to  measure  how  great  the  di- 
vergence in  X. (price)  from  its  average  is  associated  with  a  unit 
divergence  in  Y  (quantity);  and  (2)  to  obtain  some  idea  as  to 
how  closely  the  relation  is  fulfilled. 

Deviations  from  the  averages  are  t alien  because  frequently  a 
direct  proportion  may  be  discovered  between  such  deviations  which 
is  not  apparent  between  the  original  values.    The  two  following 
series  (X  and  Y)    serve  to  illustrate: 


X 

X-av 

Y 

Y-av . 

10 

-4 

3 

+4 

12 

-2  : 

6 

+  2 

14, 

0 

4 

0 

16 

+  2  « 

2 

-2 

18 

+4 

0 

-4 

Evidently  the  relation  between  deviations  in  x  and  in  y 
may  be  expressed  by  the  factor,  -1.0    ^e  may  say: 

X-M^  =  -1    (T-MJ  (1) 

where  M  stands  for  the  respective  arithmetic  averages.  The 


6. 

relation  "between  the  original  values  cannot  be  otherwise  ex- 
pressed     On  the  other  hand  if  there  should  "be  a  direct  relation 
between  X  and  Y  as  shown  in  the  following  table,  where  X  =  2Y, 
the  deviations  from  average  will  show  the  relation  quite  as  suc- 
cessfully as  if  the  ratio  of  the  original  items  was  taken: 


X 

X-mx 

Y 

Y-my 

2 

-4 

4 

-8 

4 

-2 

:  8 

-4 

6 

0 

,  12 

0 

8 

+  2 

16 

+4 

10 

+  4 

i  30 

+  8 

Since  there  is  something  to  be  gained  and  nothing  to  be 
lost,  therefore,  the  attem-ot  should  be  to  establish  the  proportional 
relation  between  deviations  rather  than  between  original  items. 
The  first  problem  can  new  be    stated  more  specif  ically:  vre  wish  to 
discover  the  value  of  b  in  expressions  of  the  tj/pe : 

X-Mx  =  b(Y-M  )  .    .     .  (2) 

or,  letting  s  and  y  represent  the  deviations  from  average  of  X  and 
Y  respectively: 

x  -    by  ...  (3) 

which  may  also  be  written,  of  course,  ' 

X  -  M  -bM  *bY  ...  (4) 

or,  since  b  My  will  bo  a  constant,  more  briefly 

X  -  K+bY  .   .  •  .  (5) 

which  is  the  familiar  expression  of  the  formula  for  a  straight  line 
on  graph  paper  of  coordinates,  X  and  Y. 

Since  X  is  determined  by  Y,  in  this  set-up  X  may  be 


termed  the  dependent  variable  and  Y    the  independent .  If 

Y  =  K*bX  .   ,   .  (6) 

were  written,    Y  would  "be  the  dependent  and  X  the  independent. 

Since  the  first  task  cited  above  is  to  find  how  great  a 
divergence  from  average  in  X  is  associated  with  a  unit  divergence 
in  Y,  a  value  of  b  in  formula  (3)  should  be  found  so  that  the 
equation  would  be  true  for  every  pair  of  observations.     The  capital 
letters  X  and  Y  are  taken  to  represent  original  values,  while  the 
small  letters  x  and  y  represent  deviations. 

Taking  numerical  data  as  shown  in  table  1,  and  considering 
the  first  case,  it  is  possible  to  solve  for  a  value  of  b,  for  if  x  = 
b.y,  then  since  x  -  0  and  y  =  -1,  b  must  be  equal  to  0. 

In  like  manner  b  is  found  to  equal  0  in  the  second  case;  but 
in  the  third  observation  b  is  found  to  equal  -.5;  in  the  fourth, 
infinity. 

Evidently  there  is  no  ^ne  value  of  b  which  will  satisfy  per- 
fectly for  all  the  observations  the  formula 

x  r  by 

and  it  is  only  in  the  most  unusual  circumstances  that  series  will  be 
found  where  a  single  value  of  b  will  satisfy  perfectly  for  all  ob- 
servations the  requirement,  x  -  b.y  . 

Logically,  therefore,  the  value  of  b  which  will  come  the 
nearest  to  satisfying  all  the  observations,-  -the  "most  probable 
value1'  of  b,  should  be  found,  and  described  as  the  probable  relation 
existing  between  the-  deviations  in  the  two  series. 


8. 

In  the  application  of  the  laws  of  probability  a  general 
procedure  for  finding  such  values  has  been  evolved.     This  is  the 
method  of  least  squares.  \j      It  assumes  that  the  criterion  of 
"best  value"  or  "best  fit"  is  that  the  sura  of  the  squared  residuals 
shall  be  a  minimum.     The  meaning  of  this  can  be  readily  understood 
if  a  dot  chart  or  Gait  on  graph  such  as  shown  in  Figure  1  is  made, 
where  the  ordinates  are  the  values  of  x  and  the  abscissae  the  values 
of  y.    Each  dot,  when  properly  located,  represents  an  x  and  a  y 
value  by  the  distances  from  the  x  and  y  axes. 

Because  x  =  b.y  (see  formula  3),  the  value  of  x  will  be  zero 

when  the  value  of  y  is  zero;  hence  any  line  drawn  to  represent  the 

origin  _  -hero  x=0  and  y=0. 

relation  of  x  to  y  must  pass  through  the 

A  line  drawn  diagonally  from  the  upuer  left  cor- 
ner to  the  lower  right  will  obviously  come  nearer  passing  through 
most  of  the  dots  representing  data  from  Table  1  than  will  a  hori- 
zontal line.     Should  such  a  line  be  drawn  on  the  graph  and  the 
positive  and  negative  vertical  distances  between  each  dot  and  the 
line  measured,  the  measurements  found  would  represent  the  "residuals." 
The  value  of  b  defines  the  slope  of  the  line,  for  b,  according  to  the 
formula,  x  =  b.y,  is  nothing  but  the  distance  covered  in  the  x  direc- 
tion when  there  is  a  unit  change  in  the  y  direction.    This  is  true  no 
matter  what  portion  of  the  line  is  considered,  so  the  line  must 
necessarily  be  a  straight  line.    Relationships  producing  such  lines 
are  therefore  described  as  a  linear,  to  distinguish  them  from  rela- 
tionships producing  a  curved  line,  which  are  described  as  curvilinear. 
In  the  above  example,  b  must  be  a  negative  value,  since  whenever  y  in- 

Merriman,  Mansfield.    Elements  of  the  Method  of  Least  Squares. 
London,  1877. 


9. 

creases  in  value,  x  decreases.    The  criterion  of  ""best  fit" — that 
is,  that  the  sum  of  the  residuals  squared  must  "be  a  minimum — means 
merely  that  the  value,  b,  or  the  slope  of  the  line  on  the  chart 
must  he  such  that  the  sum  of  the  squared  vertical  distances  be- 
tween the  dots  and  the  line  Will  he  a  minimum.     This  then  is  the 
"most  probable  value"  of  b.    The  finding  of  "b  will  be  the  accom- 
plishment of  the  first  portion  of  the  problem. 

In  order  to  secure  the  values  of  the  differences,  or  resi- 
duals, one  must  know  the  ordinate  values  of  the  points  on  the  line 
to  substract  from  the  known  values  of  x.    But  for  any  given  y 
value  this  value  will  be  b.y,  which  may  be  termed  x'  ,  since  it  will 
be  the  value  of  x  secured  if  it  were  estimated  from  its  described 
relation  to  y  as  shown  by  the  line.    Letting  z  represent  any  resi- 
dual, therefore,  z  =  x-x'  or 

z  -  x-by  ...  (7) 

And  the  best  value  of  b  in  the  relation,  x  =  b.y,  will  be  such 
that  Sz2  will  be  a  minimum.     2/    The  value,  b,  in  correlation  termi- 
nology is  the  "regression  coefficient"  or  the  "regression  of  X  on  Y." 
It  is  the  amount  of  change  in  X  associated  with  a  unit  change  in  Y. 
It  may  be  distinguished  from  other  regression  coefficients  by  sub- 
scripts, the  initial  subscript  indicating  the  dependent.    The  sub- 
scripts will  change  in  order  according  to  whether  the  regression 
of  X  on  Y,  or  the  regression  of  Y  on  X,  is  being  described.  Thus 

— '     The  designation  "S"  before  a  term  may  be  taken  to  mean  the  sum  of 
the  terms  like  that  before  which  it  stands.     It  is  frequently  expressed 
by  the  Greek  letter  Sigma, 


10 


x  r  bxyy  and 

y  =  byXx  ...  (8) 

To  find  the  value  of  bxy  "by  methods  of  least  squares,  multiply 
each  observation  equation  through  by  the  coefficient  of  bxy  in  that 
eeuation:     (subscripts  to  x  and  y  designate  the  observation  number) 


Observation:  Coeff .   of  b 


Extension 


xiyi  =  "bxyyi 
2 

~2?2  =  bxyy2 
Vn  z  ^xy^n2 


Sxy  -  bxyBy2 


Summing  the  product  equations  so  secured,  gives 

Sxy  1  bxy  Sy2 


(9)  a 


The  values  Sxy  and  Sy2  may  be  conveniently  secured  by  some  such  arrange- 
ment as  shown  in  table  1.    Equation  (9  a)  or,  for  the  example,  -244  -  bxy 
278,  is  called  a  "normal  equation"  pnd  its  solution  gives  the  value  of 


bTV  3=  Sxy    =  -244 


-.878 


Syrf  278 

The  value,  Sxy/n  or  pxy,  is  termed  the  product  moment.  Thus 

.2  .  •  ..  >i  ,oHi 

t»xy  =  Pxy/^y    which  is  perhaps  a  more  convenient  way  of  defining  ^Xy- 

Having  found  the  value  of  bxy  it  can  immediately  be  inserted  in 

the  equation  and  the  equation  written, 


x'  -  hxyy 


or 


X-  _  -.878y 


11. 

Equation  (9)  is  termed  the  "regression  equation"  in  correlation 
terminology .     The  values  of  x'  and  y  may  easily  be  expressed  in  terms 
of  original  items  instead  of  deviations  from  average  "by  writing  in  the 
equivalents  of  x  and  y. 

X*  "Mxr  VY"V  '  ■  '  '  (10) 

or  .  X»  =  U-.Jb    M  +  b    Y  ...  (11) 

x1    xy  y  xy 

Mx  may  oe  substituted  for  Mx,  since  S::T  =  Sx  =  0,  thus  leaving  the 
means  of  the  X?     and  X  series  identical. 

The  first  object  has  'now  "been  accomplished:    to  measure  how  great 
a  divergence  in  the  dependent  is  probably  associated  with  a  unit  diver- 
gence in  the  independent.    A  value  of  b      has  been  secured  which,  if 
applied  to  each  value  of  y,  will  give  products,  x',  that  will  most  nearly 
equal  the  associated  values  of  x;  the  sum  of  the  values,  z=x-x',  squared, 
will  he  less  than  if  any  other  value  of  b^y  should.be  used. 

As  stated  previously,  the  line  on  the  graph,  x'  =  ^XVY,  nrust 
pass  through  the  origin,  where  both  x  and  y  ^e  zero;  and  letting  y 
equal  any  other  value,  x1  may  be  determined;  the  plotting  of  this  point 
then  determines  the  line.     In  correlation  terminology  this  line  is  called 
the  regression  line. 


12 


2.  Correlation. 
The  second  problem  as  stated  in  the  "beginning,  is  to  secure 
some  idea  as  to  how  truly  the  described  relation  is  fulfilled,  or 
as  can  be  said  now,  hov?  closely  the  x'  values  coincide  With  the  x 
values ,  or  in  another  way,  how  nearly  the  x  values  come  to  being 
on  the  regression  line.    An  inspection  of  the  graph,  Figure  1, 
gives  some  idea,  and  prompts  the  thought  that  a  numerical  measure 
of  the  agreement  might  be  obtained  by  averaging  the  residuals. 
Accordingly : 

Sach  7  -  x-xr 

=  x~^>xyy 

Therefore      Sz  =Sx-Sy.bxy  .   .   .  (12) 

But  both  3x  and  Sy  are  zero,  hence  Sz  is  zero,  which  means  that  only 
by  disregarding  the  signs  of  z  could  such  a  numerical  measure  be 
secured.     This,  however,  would  prohibit  any  further  rigid  mathemati- 
cal treatment  of  the  measure.     So,  just  as  the  signs  are  eliminated 
in  securing  standard  deviation  by  squaring  the  differences,  so  the 
signs  of  the  residuals  may  all  be  made  positive  by  squaring  them. 
In  other  v/ords,  the  standard  deviation  of  the  residuals  may  be 
taken  as  an  inverse  measure  of  the  closeness  of  fit  of  the  line. 
This  standard  deviation,  0^,  is  e v i do nt ly^ S z  / n  since  the  mean  of 
the  series  ic  zero.     In  correlation  terminology  it  is  called  the 
"standard  error  of  estimate,"  and  is  sometimes  symbolized,  sxy. 


13 

But  taking  the  standard  deviation  of  the  residuals  is  not 

enough,  for  this  value  can  change  according  to  the  value  of  the  scale 

or  unit  used.    For  comparative  purposes  a  measurement  more  on  the  order 

of  a  coefficient  that  will  not  change  with  changes  in  units  or  scales 

is  needed.    To  do  this  07,    should  "be  expressed  in  terns  of  some  £orm  of 

the  series,  X,  and  since  the  standard  deviation  of  z  is  "being  taken, 

what  would  "be  more  logical  than  to  take  the  standard  deviation  of  the 

original  series?    Let  the  mathematical  relation  existing  between  them 

accordingly  be  determined. 

Each    z    -    x  -  xf 
"        z2  _    x2_  2xx,  +  x,2 

»     •     Sz2  -    Sx3  -  2  Sxxr  *  Sxl2 

n  n  n  n 

  (13) 

But  Sxx'/n,  or  the  product  moment,  Pxx'»  °f  x  an<^  the  estimates  of  x 
from  the  regression  equation,  may  be  shown  to  equal  Sx,2/n: 
Each  xx '    =  x  .  b  y 

=  x  (~—  )y 

Sy* 

.   .     Sxx'     =  Sxy  Sxy 

Sy2  .  .  .  (14) 

Also,  each  x'2  «  b^y2 

=  Sxy  Sxy  y2 
Sy2Sy2 

Sx'2  =  Sxy  Sxy 

Sy2  •   ■   •  (15) 

Cancelling  in  equation  13 

Sz2/n  -  Sx2/n  *  Sx,2/n  *  .  .  (16) 

The  mean  of  the  xl  values  in  zero,  so  Sx!  /n  is  the  standard 

deviation  squared  of  the  x'  values,  the  estimates  of  x  secured  by 


14 

Multiplying  each  y  by  bxv,  '11  of  which  values  lie  on  the  regression 
line.  Accordingly, 

X2         2       2  .  , 

0~    r  O".  -  O^i  ...  (17) 

or         eg;2  -  C^?  *  %  ...  (is) 

These   ire  very  important  forradire.    The  meaning  of  the  rela- 

ti  "-is  shown  rany  "be  st- tod  thus:    The  proportion  of  the  total  squared 

v  iriability  in  the  dependent  that  is  clue  to  y  as  expressed  "by  xf 

2   .  2 

v-lues  may  be  expressed  "by  the  fraction  UZt  and  the  proportion 

of  the  total  squared  v-  ri-  bility  in  x  not  explained  by  y  may  be  ex- 

2    2  _2  2 

pressed  by  the  f ruction. 0t/0^  .    The  expression  G^t  ra:,v  te  termed 

2     _  2 

a  cocficcieat  of  determination,  dxy,  end  0^    /OJ  .  is  the  proportion  of 
unacc  untcd-for  squared  V  ric.bility.    The  sum  dxyrdxz  must  always 
equal  unity,  and  since  the  two  represent  derivations  from  sums  of 
squares  they  are  always  positive  in  sign. 

If  we  take  the  square  roots  of  these  measures,  getting  them 
b~ck  to  coefficients  of  the  first  order,  wo  h~ve  0X,/CZ'  nhd 
The  expression  (TjCZ  is  the  ratio  of  the  standard  deviation  of  the 
residuals  to  the  standard  deviation  of  the  original  x  series,  'Vhen- 
everthis  value  becomes  1  rge  it  means  that  there  is  vry  little 
relation  bet'-'cen  the  x  series  and  the  y  series,  o.nd  when  sm-.ll  it 
me-ons  that  the  standard  deviation  cf  residuals  compared  to  the 
original  standard  deviation  of  x  is  small,   end  hence  that  the  values 
of  bxyy  aparoxinrtc  closely  to  the  original  values  of  x.     Since  it 
v  rios  inversely  with  the  closeness  cf  the  relation  between  x  -nc.  y 
it  is  termed  the  coefficient-  cf  alienation.     The  coefficient  G^/Ox 
or  r.-y,  since  it  vrijs  dircci-ly  with  the  cl-aea?ss  i£  the  relation 


15 

"between  x  and  y,  is  termed  the  Pearsonian  Coefficient  of  Correlation, 
after  its  originator. 

If  the  value  of  bXy  is  negative  the  coefficient  of  correlation 
is  said  to  be  negative  and  is  always  preceded  "by  a  negative  sign.  Its 
meaning  in  that  event  is  that  the  dependent  becomes  smaller  in  value 
as  the  independent  becomes  larger,  as  is  the  case  in  the  example,  Table 
I,  and  Figure  1. 

From  the  formula,  derived  from  18, 

?  P 

or  op 

  4    =    i       ....  (IS) 

Ox  05 

It  is  apparent  that  neither  coefficient  can  ever    be  greater  than  pluc 

or  minus  1.00.     If    0""  1     is  either  plus  or  minus  1.00  (and  0^ 

CI  Ox 
therefore  zero),  it  means  that  a  perfect  interpret  at  ion  of  x  can  be 

m?de  from  y,  i.e.,  that  all  observed  values  of  x,  when  plotted,  lie  on 
the  regression  line,  x1  =  bwy  ;  conversely,  if  CJ^r    is  zero  it  m<  ~n  s 
that  the  standard  deviation  of  the  estimates  is  zero,  therefore  that 
the  regression  line  is  horizontal,  and  furthermore  that  the  value  of 
bXy  is  zero.     It  means,  thus,  either  •(!)  that  there  is  no  relation  be- 
tween x  and  y,  or  else  (2)  if  there  is  any  relation,  it  can  not  be  ex- 
pressed by  a  straight  line.    Many  students  are  prone  to  explain  low 
values  of  r  by  the  first  reason,  whereas  there  may  be  a  very  high  degree 
of  relation  shown. when  the  proper  curve  is  substituted  for  the  straight 
line.     It  is  customary,  however,  to  speak  of  two  variables  as  uncorrelate.d 

when  r  is  very  small;  but  it  must  not  be  forgotten  that  the  second  mean- 
given 

ing/above  may  also  be  the  explanation  of  low  values  of  r. 


16 

The  formula  for  the  usual  method  of  computing  (and  defining) 
r  may  "be  derived  as  follows: 

From  (15),  Ox.*2        -  -pxy2 

Oy 


^  5  /  T)xy2  =    pxy  ....  (22) 


V 


Oy^  Oy 


Dividing  "by  Ox    to  secure  ff- 


rr;      -    Q-1        -    pxv  or     ITxy  ....  (23) 

GxOy  aOxOy 


which  is  the  usual  definition  of  the  Pearsonian  Coefficient  of  Correla- 
tion. 

If  y  is  made  the  dependent  and  the  value  of  tw  (note  change 

in  order  of  subscripts)  solved  for  in 

y    -    ~by^  then 

ryx  -  .  0j±   (24) 

Oy 

This  value,  TyX  may  also  be  shown  to  equal  pxy/Ox  Oy 
and  hence  is  equal  to  rxv.     The  order  of  writing  the  subscripts  is 
therefore  unimportant.     This  is  not  the  case  with  the  regression  co- 
efficients, however,  since  they  represent  two  different  lines  on  the 
graph,     byjj  is  graphed  on  Figure  1  as  a  broken  line. 

Since  bxv  -  it  takes  its  sign  from  the  sign  of  p 

Oy2 

as  will  the  coefficient  of  correlation  when  computed  from  formula  23. 
Both  regressions  are  evidently  of  the  same  algehraic  sign  with  the 
coefficient  of  correlation. 


17 


The  coefficient  b   _  nay  he  expressed  in  terms  of  r,  c    and  a 


as  follows: 


hxy  -  3      *s      (er  a  )      a*     =  •  r    ax   (25) 


ay2  cry  ay 


from  which  the  regression  nay  he  written  in  the  form: 


or 


ax 

x    =    r  — —      y  . .'.  .  (26) 

ay 

-2L-  =    r  _x_  ...'.(27) 

Similarly  the  other  regression  equation  (8)  may  "be  written 
y    =    r  _x_ 

ay  °x  ....(28) 

The  inference  to  be  drawn  from  these  fornalae  gives  an  additional 
meaning  to  the  conception  of  the  coefficient,  of  correlation,  i.e., 
when  the  variations  in  the  series  arc  reduced  to  comparable  denomina- 
tors, their  standard  deviation,  r  expresses  completely  the  relation 
between  them — both  the  amount  of  change  in  each  associated  with  unit 
changes  in  the  other  and  also  the  degree  to  which  this  relation  may  be 
expected  to  be  maintained.  ' 

An  interesting  relation  useful  as  a  check  on  confutations  may 
be  shown  to  exist  between  r,  bXy  and  byX  : 

As  shown  above  (25) 

ax  aV 
bxy  =  r   ,  and  byx  =  r 


Multiplying  the  two  equations  together  gives 


3 .     Sunraary  of  f ornulae 


id?:  ' 


ax2  =  a:c,2    +    a72  XV 


a 

x 


II  ax,2/ax2  +    a^/a,,2  =    1         XVI  y       =  r 


VIII  =  V^T 


yx 


xni        \y  =  _n 


a  2 

y 


XIV  =  r  ax 

ay 


a 
x 


=    a22/ax2  XVII  x'       =  *xyy 

IV  dxy    =    ox,2/ax2  XVIII       y'      =  tyxx 


V  dX2    +    d       =    1  XIX  X'      =  Vtx  -  My  bxy  +    Taxy  Y 

VI  rxy    =    ryx  XX  Sxx'  =  ?xxi 


VII  =    'ftxyr  ra 


u. 


2 


ix  =  Ay  V  xxn       ^f-  =  %y' 

X  =    crx,./ax  XXIII  =  ayt2 

XI  =    ^y»/c:y  XXIV  =  ciz  =  ax  Vl-rxy2 


XII  =    ___:?___  XXV  rxy    V1  -  sxy2 


°**7  ax2 


19 


Suranary— Cent '  d 


XXVII 


XXVIII 


n 

#  0 


XXIX 


XXX 


XXXI 


r  -  0 

zy  - 


XXXII 


XXXIII 


n 


'zy 


—  _  v 


XXXIV 


rx,y    »  l.CO 


Note :    It  is  useful  to  observe  th.  fit  a 'I 'ling  a  constant  to 
each  item  in  a  series  in  no  wise  offsets  the  r.orc  tflatvon, 
for  since  every  item  is  chj'i^cc'  si  LrailaKj^  the  deviations 
rena:  n  ■unchanged.    Kalta^Xy.iag  b;  •    a  ecus  tar.!;  does  not 
change  the  correlation,  f  &  zktaniftlfccr  and  denominator  in 
r  a  p/uhCC  are  changed  proporUonaf  Uy.    The  regressions 
change    how.e?  or . 

Coefficients  of  regression,  alienation,    correlation  and  determi- 
nation have  been  secured,  which  provide  ic:  r?.y    eoasplete  tools  r3  th  which 
to  meet  the  two  problems  outlined  at  the  beg'bon:  ir:.g  of  this  discussion. 
This,  however,  is  all  based  on  the  as,.rjfn-pt:lci<  ti*  Vf.t  relation*!)  if*  are 
linear.    In  the  event  that  such  be  not  the  oas-e  ;  It  *f,sls  to  fene  need 


20. 

for  modification  of  these  measures.     The  correlation  ratio,  and  the 
correlation  index,  are  such  modifications.     Their  formulae  and  meaning 
will  subsequently  "be  developed  along  substantially  the  same  lines  as 
were  those  of  the  measures  described  in  this  section. 

4    Ar  i  thme  t  i  c  Me  th  o  d  s . 

Some  formulae  useful  in  reducing  labor  of  calculating  the  various 
coefficients  discussed  in  the  previous  sections  may  be  derived. 

An  inspection  of  the  formulae  for  all  of  these  coefficients 
shows  that  they  may  be  computed  from  three  values,  Oj,  Oy  and  pXy  The 
computations  involved,  once  these  values  are  secured,  are  but  slight. 
The  tedious  part  of  computing  the  coefficients  is  in  securing  these 
three  values.     In  treating  the  original  series  the  first  step  involved  was 
to  secure  the  devic.t '.ons ,  x  and  y,  of  each  X  and  Y  value  from  their  re- 
spective means;  and  since  the  differences  so  secured  are  more  apt  than 
not  to  be  fractions,  the  arithmetic  of  squaring  and  multiplying  them 
together  is  extended.     Formulae  for  securing  the  values  of  Sxy/n,  Ox 
and  Oy"  may  be  easily  developed  which  eliminate  the  necessity  of  using  these 
deviations.     They  represent  a  marked  saving  of  labor,  especially  for 
those  who  have  access  to  any  of  the  standard  computing  machines. 


The  standard  deviation  of  X  nay  be  repressed  as  follows: 

a  2        =1  (Sx2) 

x  n 

=    1      S  (X-ll)2 
n 

2  2 

=    1      (S  X  -  2  M    3  X  4    n    M  ) 
~  x  x 

n 

(SX2  _  2  SO*  SX    -f    SX  SX 
n  n    n  n  n 

~    SX    —  M 
—   x 

n 

ax         =  >/Sa    -  M... 
n 

The  use  of  this  formula  obviates  the  need  of  taking  devia- 
tions in  securing  the  standard  deviation  of  X.    Subs ti tilting  Y  for 

X  in  the  formula,  of  course,  gives  an  expression  for  the  standard 
deviation  of  Y. 

Any  whole  number  approximating  the  mean  may  be  used  as  an 
assumed  mean  from  which  to  secure  deviations  and  a  subsequent  cor- 
rection will  give  the  true  standard  deviation.    This  method  means 
handling  smaller  values  than  used  in  formula  29  and  yet  eliminates 
the  fractional  difference. 

Let  M    -f  e  -  ST,  the  assumed  mean,  differing  from  the  true,  M 
by  the  amount,  e:. 

The'n  each  deviation,  x,  "secured  from  the  assumed  mean  is 
described 

x  =  x-Ejj. 


22 

and  ho  nee ,  each  x  —  x-e 

where  x  is  the  true  deviation,  X-M 
_2       2  2 
Then  each  x    =  x    -  2xe+e 

J  2  .2 

and  Sx    =  sx    _  2eSx  +  ne 

But  S>:  is  ty  definition  equal  to  zero. 

7JT  2 

Therefore  Sx    =    Sx    +  ne 

-2 

and  Sx    -    or2  +  e2 

n 

or  ax2  =    Sx2_  -      e2   (30) 

n 

That  is,  the  true  standard  deviation  squared  is  secured  "by  subtracting 
from  the  mean  square  of  the  differences  from  the  assumed  average  the  sqiiare 
of  the  difference  between  the  true  and  assumed  averages.     It  might  be  noted 
that  Lhe  formula,   (23)  represents  a  special  case  of  formula  (30) ,  for  in 
the  former  the  assumed  mean  is  zero  .and  hence  the  correction  to  he  made  is 
the  full  value  of  the  mean  squared.     It  is  helpful  to  note,  also,  that  the 
correction  is  always  a  subtraction  whether  the  value  of  e  is  plus  or  minus, 
since  the  squaring  takes  out  the  negative  signs.     A  convenient  assumed 
mean  is  one  which  is  a  mult  role  of,  ten  and  of  annroxiraately  the  value  of 
the  smallest  item  in  the  series.     It  is  very  easy  to  take  the  differences 
then,  and  there  are  practically  no  negative  quantities  to  confuse  the  compu- 
tation of  the  product  moment. 

Just  as  the  standard  deviations  may  be  defined  in  terms  of  the 
original  values,  so  also  may  the  product  moment: 


23 

Each  xy    =    (x  -  M  )   .  (Y  -  M  ) 

X  y 

=    XY  -  My  X  -  Mx  Y  +  MxMy 

nM:.  M" 

Sxy/n  =    SXY/n  -  M    SX/n  -  M    SY/n  +  x 

n 

#    SXY/n  -MxMy  (31) 

Or  if  it  is  desired  to  use  assumed  means, 

My      =     My  +  ey 
Then  each  x        =      x  -  e 

x 

and  each  y         =      y  -  e 

Then  each  x  y    -      xy  -  xe  -  ye        c  c 

y-       x       x  y 

and    Sx  y/n       =      Sxy  -    ey    Sx  -  ex    Sy    +    e  e 

n   .  n  n  X  y 

or  Pxy  =  Sxy/n  =      Sx  y/n  -  ex  ey   (32) 

It  should  be  noted  that  the  correction,  e  e      may  bo  either  -oositive 

x  y 

or  negative;  care  should  accordingly  be  given  to  its  sign  in  determining 

V-^r  ^y  this  method.     The  most  convenient  method  of  listing  the  arithmetic 
■&y 

involved  in  computation  of  p,  ox  and  Cy  by  the  above  formulae  is  shown  in 
the  tabular  form  given  belov/. 


24 


Observation 
Number  : 

X  _.  : 
or  x  i 

Y  ._ 
or    y  : 

or  x 

or  y 

XY_  or 

x  y  ; 

1 

2 

3 
i 

i 

i 
t 

xl 
i 

yi 

! 
f 

r 
i 

"2  • 

*i 

xS  ' 

r  . 

i 

r 

t 

t 

•  X2 

2 

yi 

A 

1 
I 

I 
t 

#-n 

A2 
i  i 

!  t 
t  1 
1  f 

Sums 

Sx 

sy 

n 

:  Sy~ 

:  Sxv 

Means 

MX 

My 

O 

:  Mx° 

■  My2 

:  Mxv 

Corrections 

•  -m| 

.  -M  M 

Sum  to  secui 

*e  .  6.  . 

:  CT2 

:  *a 

:  P^y 

In  the  event  that  computing  machines  such'  as  the  Monroe  are 


used  in  which  it  is  possible  to  cumulate  the  squares  and  products, . 

the  values,  S  X2,  S  Y2,     ari    S  XY    may  "be  secured  without  listing 

them.    Thus,  put  the  value  of  X^  into  the  keyboard  and  extend  the 
2 

product,  X^    into  the  register.    Without  clearing  the  register  extendi 

2  2 
X^    and  so  on.     S  X"  raa$  'be  read  from  the  register  after  the  last  ex- 

tension  has  "been  made.     If,  when  using  the  Monroe,  the  digit,  "1", 

in  the  left  row  of  the  key-board  is  locked  down  throughout  the 
p 

securing  of  S  XJ,  the  value,  S  X,  will  appear  in  the  left  of  the 

register.     This  either  saves  the  adding  of  X  as  a  separate  ouera- 

p 

tion,  or  partially  check  the  extensions  in  securing  S  X  . 


Table  1 


25. 


Record 

Independent ,  Y  : 

Dependent ,  X  : 

2  i 

No. 

X 

My 

y=Y-My; 

!  x  : 

mx  ■: 

x-X-Mx ' 

.  y 

X 

1 

18 

19 

-1  : 

:     10  i 

10 

0  : 

:    1  : 

0  : 

0 

2 

18  : 

19 

-1  : 

:     10  : 

10 

0  : 

:    1  : 

0  : 

0 

,Z 

17  . 

19 

-2  : 

:     11  i 

10 

+1  : 

.    4  j 

1  . 

-2 

4 

19  ! 

19 

0  : 

:    11  : 

10 

+1  : 

:  0 

1 

0 

5 

21 

19 

*3  : 

9  i 

10  . 

-1  ! 

:    4  : 

1  : 

-2 

6 

20  : 

19  i 

4-1  : 

:      8  : 

3.0 

-2  5 

:  1 

4 

-2 

7 

16  ; 

19 

-3  : 

:     12  : 

10 

4-2  : 

:  9 

4 

-6 

8 

20  . 

19 

+•1  : 

!      8  : 

10 

-2  i 

:  1 

4 

-2 

9 

22  ! 

19 

+3  : 

:      8  ; 

10 

-2  : 

!  9 

4 

— o 

10 

20 

19 

4-1  ; 

i    10  : 

10 

0  : 

:  1 

0 

0 

11 

19 

19 

0  : 

:      9  : 

10 

-l  : 

:  0 

1 

0 

12 

23 

19 

4.4  : 

:      7  i 

10 

-3  : 

:  16 

!  9 

-12 

13 

5 

.  19 

:    -14  : 

:    '23  : 

10 

:    ^13  ! 

:196 

169 

.-182 

14 

:  21 

-  19 

:     4-2  : 

!      7  : 

10 

:    —3  ; 

;  4 

:  9 

!  -6 

15 

•  18 

:  19 

:      -1  : 

:    12  : 

10 

;      4-2  : 

:  1 

:  4 

:  -2 

16 

•        o  , 

~\  n 
±\j 

A  . 

:  25 

:  16 

:  -20 

17 

:  20 

i  19 

:      ^-1  : 

:     10  : 

10 

:       0  : 

:  l 

:  0 

:  0 

£,18 

:  19 

:  19 

0  : 

:      9  5 

10 

:      -1  : 

:  0 

;  1 

:  0 

n  19 

:  19 

•  19 

:       0  : 

:     11  : 

10 

:     4-1  : 

:  0 

:  1 

:  ""> 

20 

:  21 

:  19 

;     4-2  : 

:      9  : 

10 

:      -1  : 

:  4 

:  1 

:  -2 

n=20 

:  380 

:  0 

::  200  : 

:  0 

.  :278 

:230 

:-244 

Means  : 

:  19 

:  0 

::     10  : 

:  0 

: :  13  . 9 

;  11.5 

:  -12, 

Standard  deviations 


Product  moment 


26 


Double  frequency  table 
Table  2 


A  . 


Assumed 


Glasses 


300-  !  305- 
304  [_30S 


310- 
314 


h requencies 
summed 


315-  320- 
319  324 


325-1  .330- 
329   !  '.334 


335-  340- 
339  344 


345- 


Frequencies 
summed 


64   |  105      159 1  132  76 


27 


TABLE  4.     COMPUTATION  OF  Oy 


7  Values  (assumed) 


Frequency 
f 


M. 


2 


5 
4 
3 
2 
1 
0 
1 


25 
16 

9 
4. 
1 
0 
1 
4 
9 
16 


2 
3 
17 

64 
105 
159 
132 
76 
23 
3 


-10 
-12 
-51 
>128 
-105 
0 

=-132 
=-152 

r  69 

*  12 


50 
48 
153 
256 
105 
0 
132 
304 
207 
48 


Sums 

Moans 

My  2 
n=2 

Sauare  root  = 


0% 


584 


59 
.101 


1303 
2.23 
-01 
2.22 
1.490 


•7  = 


5 

7.450 


28 


TA3LE  3..  COMPUTATIONS  OF  Q~ 


X  values 

t  as  sained.  J 

fx2 

..  ic 

c~2 

X 

Frenuency  : 

fx 

•v-  4 

>A  • .  16 

i  ; 

;  *  4 

!  16 

*  3 

:  9 

6  : 

-5-  18 

54 

->  2 

"4 

62  \ 

«*i24 

248 

±  1  : 

:  1 

150 

:>s-X50 

150 

0 

0 

13?  : 

o 

:  0 

-  1 

1 

:       135  : 

•-135' 

.  135 

o 

—  c. 

•53 

-106  ' 

—  % 

i  Q 

!        36  L; 

:-108  *  * 

4 

16 

4  : 

:-  16    .  * 

-  b 

25 

0  : 

:-  0 

0 

Sums 

584  : 

69 

1203 

Means 

-.1181. 

.  2.06 

1%  squared 

:"  .01 

Jx 

2.05 

Square  root 

•  1.432 

°ix  cx  = 

£  '> . 

2 

Product r;.  »- 

Ox  « 

2.864 

29 


TA3L7D  5.     COMPUTATION  OF  p 


A.  e  siune  &  v  a  lu:  s ;  ~  P  r  o  due  t  s 


Assumed  values  ;  -  Products 


X 

V 

X 

Y 

f 

:  fxv~ 

:  x 

y 

XV 

:  f 

fxy 

*  A 

4  4b 

w 

X 

1 

 ,     ■    tA-  — 

4  4 

1 

9 

T  Q 

—  la 

o 

T  O 

3 

12 

■v 

1-7 

O 

p 

- 

6 

X 

-  6 

1 

vr 

1  : 

1 

13 

•f  13 

3 

- 

1  . 

d 

—  b 

4  1 

** 

m 

3 

c 

6 

'W 

p 

r  • 
tj 

10 

2 

-20 

-     X  ! 

o 

T 

3 

12  : 

p 

4.  . 

- 

8  ! 

1 

-  8 

—  1 

3 

2 

5 

10 

»  -r 

p 

p  • 

fct  * 

17 

—  bo 

-  1 

T 
1 

J. 

1 

!  10 

4  10 

P 

X  ■ 

c  ! 

19 

—38 

* 

1 

.-J-  4 

1 

X 

A. 

— 

1 

—  4 

2  ! 

T 
X 

2 

•  3 

:4  6 

1 

X 

^  « 

8  : 

*  * 

1 

X 

o  » 

O  « 

28 

— 06 

I 

X 

1 

X  . 

"~ 

*" 

1 

i  . 

bo  : 

56 

—bo 
— bb 

1 

2  i 

d  ! 

28  : 

-bb 

1 

4 

1  : 

o  . 

f 

20 

— dL 
-40 

2 

20 

-80 

2 

3  ; 

6 

8 

-48 

2 

8 

\ 

•  -  8 

3 

1  • 

ry 

15  i 

-45  " 

r? 

6 

11 

-56 

3 

3 

S 

5 

.  -15 

12 

2 

:  -24 

1 

,  p 

.  -  8 

u 

12 

:  1 

:  -12 

Sums 

v  77 

-813 

77 
-716 

Divide  t>y    a  (584)  gives  mean 
•  Subtract  rax  .ray  (-.118)  (.101) 

Px  y 

*xy    e    CxCy  P'x  v  =    -1.214  x  (5.2) 

-  1.214 
 _10 

Pxy 


1.236 
.012 


-  1,214 


13.14 


30 


TA3L4  6.     COMPUTATION  0?  MEAN  OF  COLUMNS 


Column:  : 
(Tvne) 

c —  r 
tlj 

EI/   x  : 

x  - 

y  ^ 

■  -  5  1 

2 

-f      4  : 

2.00 

V  — 

••-  4 

3  : 

*  3 

1.00 

v  — 

</    .  — 

■-  3 

17 

*  16 

•  .94 

¥  = 

-  2 

64 

:   *  58' 

-.91 

f  - 

k  1 

105 

-f  8-4 

-  -.80 

y  - 

0 

159 

+  41 

-  .26 

y  - 

-f  1 

:  132 

:        -  132. 

•1.00 

y  - 

2 

'     ■  76 

:        -  95 

■  1-.25 

v  — 

-r  3 

:  23 

:        -  40: 

1\  74 

-r  4 

I  5 

:        -  8: 

2:.  67 

Chuck  against   y-  in  0*  :-69 

10  '   ±£p   iTrf-  ■-  .1131 


TABLE  7."    COMPUTATION  OF  Om  and  Nxy 


From  table 

n  -ZZ 

f 

j- 

m 

2 

m 

2 

n  m 

2 

2.00 

4.00 

8.00 

3 

1.00 

1.00 

3.00 

17 

JL 

.94 

.86 

14.62 

.  64 

.91 

.83 

53il2 

105 

.80  ■ 

.  64 

67.20 

159 

.26 

.68 

108:12 

132 

1,00 

1.00 

:  132.00 

76 

1.25 

:  l'.56 

:  118,56 

23 

1.74 

:  3.02 

:  69.46 

3 

2.67 

:  7.13 

:  21.39 

534  -  n 

595.47 

Me  an 

1.02 

p 

Subtract  fen  m  Y  [Table  5]  - 

0"m2                n     :  : 

Om                      :  ; 


1131  - 


.01 
1.01 
1.005 


N 


xv 


Ox< 


=  Om 


Ox 


1.005  -  .70 
1.432 


31 


32 

Double  Frequency  Table. 

When  there  are  a  considerable  number  of  observations,  there  are  pro- 
bably a  large  number  of  repetitions  of  given  values.    The  process  of 
squaring  and  multiplying  could  be  facilitated  if  these  like  values  were 
grouped  together.    It  is  much  simpler  to  square  347,  for  example,  and 
multiply  the  square  times  the  number  of  times  it  occurs,  than  to  square 
347  each  time  it  appears  and  add  the  resultants.    For  those  who  have  access 
to  punch  card  machines,  the  process  of  securing  this  grouping  is  very  sim- 
ple and  needs  no  explanation  at  this  point.    For  those  who  do  not  have 
access  to  such  machines,  or  to  computing  machines,  the  double  frequency 
table  is  practically  invaluable  in  its  labor  saving. 

A  double  frequency  or  "correlation"  table  is  simply  a  grouping  of 
items  into  classes  by  one  variable  with  each  group  reclassified  by  the 
second  variable.    In  planning  such  a  table  care  should  be  taken  to  insure 
that  the  class  interval  be  the  same  throughout  the  classes  of  each  variable. 
The  value  of  any  item  in  a  class  is  customarily  taken  to  be  the  average 
of  the  limits  of  that  class.    The  nature  of  a  double  frequency  table  can  ' 
be  easily  grasped  by  inspecting  table  3. 

In  constructing  a  frequency  table,  the  columns  and  rows  are  first 
properly  titled.    The  operator  then  notes  in  which  vertical  grouping, 

X  class,  each  observation  lies,  and  then  moves  horizontally  across  the  line 

I 

to  the  proper  Y  class  making  a  tally  mark  in  the  compartment.    When  all 
observations  have  been  properly  tallied,-  the  number  of  tallies  in  each 
compartment  is  written  in.     The  numbers  appearing  in  the  body  of  a  finished 


33. 

double  frequency  table  therefore  represent  the  number  of  observations  of 
a  given  X  value  which  occured  in  conjunction  with  the  given  Y  value. 

Note  (in  Table  2)  that  the  class  values  for  X  are  listed  in  descend- 
ing magnitude.     This,  though  not  usual,  is  done  for  the  purpose  of  bringing 
the  tabular  scales  into  conformity  with  the  scales  used  on  graphs  such 
as  shown  in  figure  1.     Indeed;,  a  rough  form  of  graph  can  be  superimposed 
on  the  table  when  the  scales  are  so  arranged. 

The  vertical  distributions  are  called  columns,  the  horizontal,  rows. 
Since  it  makes  no  difference  statistically  which  variable  runs  vertically 
or  horizontally,  either  row  or  column  may  be  called  an  "array".    An  array 
is  adequately  designated  by  the  value  of  the  class,  either  X  or  Y  as  the 
case  may  be,  in  which  its  entire  distribution  lies.    This  class  value  is 
termed  the  "type"  of  the  array. 

Also  note  that  space  is  provided  adjacent  to  the  class  designations 
for  "assumed  values."    These  assumed  values  assist  very  materially  in  the 
labor  of  computation. 

Using  the  assumed  values,  the  standard  deviations  can  be  very  quickly 
determined: 

^i)  Multiply  each  of  the  summed  frequencies  found  in  the  right  hand 
column  (table  2)  times  their  corresponding  assumed  values  of  X  and  sum  the 
products.    Dividing  by  the  total  nunber  of  cases  evidently  gives  the  mean 
of  the  X  values,  S(X)/n  -  M-    in  terms  of  assumed  values. 

(2)  Repeat,  except  instead  of  using  the  assumed  values  of  X,  use 
the  assumed  values  squared.     Secure  the  mean  of  the  -smiimed  products,  S(X^)/n. 


34 

(3)    From  the  mean  of  the  squares  so  secured  it  is  only  necessary 
to  subtract  the  square  of  the  mean,  M-,  found  in  "(1)"  apove,  to  secure 
the- -squared  standard  deviation  of  the  X  series  in  terms  of  the  assumed 
values ,  i.e.  •  ••  • 

of-  s  Sx2/n  -  m|. 

.      (4)    Multiplying  0=-  "by  the  class  interval  yields  the  true  standard 
deviation  of  X: 

°x    s  PXC% 

An  analogous  procedure  secures  the  standard  deviation  of  the  Y 
values.     These  computations  are  illustrated  in  table  3  and  table  4  for 
the  data  given  in  table  2. 

To  secure  the  product  moment: 

(1)  A  series  of  products,  fxy,  are  secured,  wherein  xy  is  the 
product  of  the  corresponding  X  and  Y  assumed  values  for  any  compartment  of 
the  table,  and  f  is  the  number  of  observations  listed  in  the  compartment. 
This  is  simply  securing  the  product  of  each  associated  X  and  Y  value  (in 
terms  of  assumed  values)  and  weighting  each  different  product  "by  the  number 
of  times  that  particular  combination  occurs. 

(2)  Subtract  from  the  mean  of  such  products,  S(fry)/n,  the  product 
of  the  two  means,  M^&s ,  found  when  securing  the  standard  deviations;  the 

difference  is  the  product  moment,  p  ,  in  terms  of  the  assumed  values,  i.e. 

^xy 

pxy  5    S(fxy)/n    -  M^My 

(3)  Multiplying  p=—  "by  the  product  of  the  class  intervals  of  X  and 

-  xy 

Y,  cxcyf  gives  the  product  moment  in  terms  of  original  values,  i.e. 

Pxy  -  cxcyPxy 

H-ving  the  values  of  the  product  moment  and  the  two  standard  devia- 


tions,  it  is  possible  to  secure  immediately  the  coefficient  of  correlation, 

r        —  p    /0~x  CT y.     Table  6  shows  the  computations, 
xy    -  ^xy 

An  algebraic  demonstration  of  the  authenticity  of  using  assumed 
values  is  given  below; 

Each  assumed  value  of  X,  J,  is  equal  to  the  original  value  of  X, 
less  the  value  of  X  corresponding  to  the  zero  on  the  assumed  scale,  V  , 
divided  by  the  class  interval,  i.e., 

X    =     (X  -  Vx)/cx 
Hence       X    =    cx  X    -f  Vx 

The  mean  of  X,  Mx>  is  therefore  equal  to  the  class  interval  times 
the  mean  of  the  assumed  values,  plus  the  value  of  X  in  the  assumed  zero 
class : 

Mx  3  SX    ,;    cxSX  *•  nVx  =    cxM-  4  Vx 

n   n  

Substituting  the  above  values  of  each  X  and  the  mean  of  X  below 

in  the  oroper  place,  the  product  moment  may  be  derived  as  follows: 

Each  deviation  in  X,  x,  =  X  -  Mx 

=  cxX  4  Vx  -  6X  Mx  -  Yx 

=  ex(x  -  uT) 

Similarly    y  _  Cy(y  -  My) 

Therefore,  each  xy      cxcv(XY"  -  XMy  -  YMX  -f  MxMy) 

Summing  and  dividing  by  n 

S|Y_    =    Pxy  3  cxcy(s|f  -  MxMy) 

=  cxcyPxy 
Standard  deviations  as  follows: 


36 

Summing  and  dividing 

CT-:2=  Sz2    -    cy2(SXl    -  ?  Mjr  SX    4  M^2') 
n  n  n 

2  ^2 

Similarly 

_2  2  2 

°y      =  cy  °7 

III      TH3  CORRELATION  R/VTIO 
Using  the  assumed  values  thresh  out—as  sthcugh  they  were  the  only 
known  values  of  X  and  Y,  find  the  value  of  b^-       By  formula  (9), 

*>xy    =  Pxy/    OJ3  =  --1.214/3.05    ±  ..,§92 
The  regression  line  must,  on  a  graph,  pass  through  the  intersectio 
of  the  means  of  x  and  y.    Using  the  correlation  table  and  its  scales  as 
a  rough  form,  of  graph,  the'  line  may  be  plotted.     See  table  2. 

The  correlation  coefficient  was  defined  as  the  relation  between 
the  standard  deviations  of  the  X  values  if  they  had  all  been  on  the  line 

and  their  actual  standard  deviation,  (T'/CT.    Put  in  another  way,  the 

x  1  x 

correlation  between' X  and  Y  is  a  function  of  the  scatter  of  the  X  values 
around  the  regression  line  compared  to  their  scatter  around  their  rrean 
line.    The  less  the  scatter  around  the  regression  line,  the  better  the 
correlation,  the  more  dependable  the  line  as  a  method  of ' estimating  the 
dependent  from  its  described  relation  to  the  independent.    But  perhaps 
a  curved  line  could  be  put  on  the  graph  which  v;ould  come  more  closely  to 
fitting  all  the  X  values  than  does  the  straight  line.     As  a  method  of 
ascertaining  if  this  is  so,  let  the  mean  value  of  X  for  each  column  be 
secured  and  indicated  by  a  small  circle.     (See  table  6  for  computation.) 


37 

The  averages  are  connected  with  a  "broken  line) 

Inspection  of  the  curve  of  these  averages  indicates  that  a  closer 
approximation  to  the  true  values  of  X  would  be  secured  if  this  curve  were 
used  instead  of  the  b^y  line. 

Bearing  in  mind  the  significance  of  the  scatter  as  a  measure  of 
the  relation,  just  as  r      may  be  defined  as^/l  ~  (T^/  CT^  where  the  re- 
siduals-are  measured  as  deviations  from  the  regression  line,  we  may  in 
analagous  fashion  define  35T    (eta)  as  ^/l  -  CT^/cr^  when  the  residuals 
are  measured  .as  deviations  from. the  averages  of  the  arrays,  thus  obtain- 
ing the  measure  of  relation  from  the  scatter  around  the  average  line 
instead  of  around  the  regression  line. 

Nxy  k  71  -  0^2/0^    (33) 

Nxv  is  called  the  11  Correlation  Ratio"  as  distinguished  from 
the  correlation  coefficient.     It  may  be  calculated  very  simply  from  the 
relation  that  may  be  shown  to  exist  between  0TrJ  (5H  and  0~,  the  standard 
deviation  of  the  means  of  the  columns  of  X  values  weighted  upon  the  number 
of  observations  in  each  group: 

The  total  squared  deviation  from  m^  in  the  first  group  (column) 
of  n^_  observations,.  S  (d^) ,  where  mi  is  the  value  of  the  mean  of  the 

X  values  for  the  group  may  be  written 

2         _  2 
S&i    -  S(x1  -  ni) 

_  2  _  2 

—  Sx-j_    -  2m-^Sx-j_  4  n]_m]_ 

—  Sx]_^  -  2m-^n-^  -f  n^m-^ 

=  Sx"!2  -  mfni    (34) 


38 


2  2 
For  all  columns  S(d  )  becomes  S(z  )  and  we  may  write: 


S(m  n) 


2 

%  " 

2 

rnl  nl  4 

) 

)  2 
(  Sx3  - 

S(z2)     -  ( 

(    1  1 

2 

m2  n2  - 
i  i 

) 

 (35) 

J  2 
(  Sxn  - 

2 

%•  nn 

S(m  n)  represent  the  sum 

of  terms 

like  OMi, 

,S(z^).S(x2) 

or  0z    -  0^  - 

°m 

(36) 

Hence , 


1  -  Cr2     _  cr2 
 z        -  m 

wx 


and 


(37) 
(38) 


Hxy  -  Om/Ox   

To  secure  the  correlation  ratio,  Nxy,  then,  it  is  only  necessary 

find  the  standard  deviation  of  the  means  of  the  X  values  for  eaci.. 

e,  weighting  each  mean  by  the  number  of  items  in  that  array. 

Form  for  computation. 
(For  computations  see  table  6) 


Sx 

n 

m 

2 

nm    -  Sx . m 

Sx]_ 

nl 

m1 

Sx2 

n2 

m2 

2 

•  ^2m2 

SxT 


n, 


mT 


nnmn 


39. 

No  natter  which  of  the  two  variables  is  considered  the  dependent, 
the  correlation  coefficient  is  the  sane;  this  is  not  necessarily  true 
for  the  correlation  ratio:  .    The  ratio  N  xy  does  not  necessarily  equal  IT  yx. 

The  correlation  ratio  is  always  as  large  as  the  correlation  co- 
efficient, usually  larger.     If  the  means  of  the  arrays  lie  along  a 
straight  line,  the  ratio  then  obviously  becomes  the  equivalent  of  the 
correlation  coefficient,  since  the  scatter  around  the  regression  line  and 
the  line  of  the  means  would  be  identical.  '.; 

A  marked  difference  between  the  ratio  and  the' coefficient  indicates 

often  that  a  straight  line  does  not  satisfactorily  describe  the  relation 

between  X  and  Y.     Since  the  ratio  is  derived  from  the  formula,  N  xy  =  rrn 

  .ox 

f  2 

.  1  -  ffz        it  may  be  either  positive  or  negative  in  sign.     It  is  customary 

to  consider  the  ratio  as  "oositive. 

It  is  not  possible  to  get  a  regression  equation  from  the  correla- 
tion ratio  that  will  fit  all  -oarts  of  the  line  of  averages,  because  there 
is  no  consistent  relation  between  any  two  joints  of  the  curve.  The 
best  that  can  be  done  is  to  describe  the  curve  itself — the  graah  of  the 
line  of  averages — as  the  functional  relation  existing  between  dependent 
and  independent. 

IV.     THE  CORRELATION  INDEX. 
If  one  should  now,  either  by  mathematical  process,  or  free  hand, 
smooth  the  curve  of  averages  derived  by  methods  shown  in  the  previous 
section,  on  the  basic  and  usually  justifiable  assumption  that  the  effect 
of  gradual  changes  in  the  independent  is  a  gradual  change  in  the  dependent- 


40. 

a  continuous  change;  and  if  one  should  further  conpute  the  value  of  the 
"root-nean-Gqucre"  deviation  residual  from  this  curve  and  use  this  value 
instead  of  the  residuals  from  the  means  line  in  the  formula  for  the 
correlation  ratio  »    */l  -  az^f  a-2     ?  a  value  would  "be  secured  which  is 
called  the  "Correlation  Index"  5./  designated  by  "pM  ,  rho.    This  value  has 

^ i\iQ  correlation  index  is  the  most  recent  of  the  correlation  measures.  Its 
formula  was  devised  and  the  name  "index"  given  it,  independently  by  Mordecai 
Ezekiel  and  F.  C.  Mills.     See  Ezekiel,  Mordecai.     A  Method  of  Handling  Curvi- 
linear Correlation  for  Any  Number  of  Variables.    Amer.  Stat  is,  Assoc.  Jour 
19;  431  -  453.     1924.     Kills,  F.  C.     The  Measurement  of  Correlation  and  the 
Problem  of  Estimation.     Aner.  Statis.  Assoc.    Jour.  19;  273  -  300.  1924. 

properties  characterizing  both  the  ratio  and  the  coefficient,  but  superior 
to  both.    A  regression  curve  is  obtained,  comparable  to  the  straight  re- 
gression lino  for  the  coefficient.     Like  the  ratio,  there  are  always  two 
index  figures  for  each  pair  of  variables;  these  however  tend  to  aooroach 
each  other  in  value  like  the  coefficient,  since  changes  in  the  magnitude 
of  the  indexes  attributable  to  extreme  items  is  less  -orrbable  than  with 
the  ratio,  since  the  smoothing  permits  the  values  in  adjoining  arrays  to 
prevent  the  curve  from  following  extreme  items  in  any  one  array.  Often 
the  theoretical  considerations  underlying  a  -oroblem  predicate  that  curvi- 
linear relations  exist;  the  index  affords  a,  method  of  describing  quantita- 
tively such  relations.     Like  the  correlation  ratio,  the  index  is  considered 
to  be  positive  in  sign.     Similar  to  both  ratio  and  coefficient  its  value 
can  never  be  greater  than  1.0.     Like  the  coefficient  the  curve  permits  the 
estimating  of  values  of  the  dependent  from  the  described  relation  to  the 
independent. 

The  only  satisfactory  way  to  compute  the  index  of  correction  is 
actually  to  measure  the  residuals  and  secure  the  standard  error  of  estimate 
in  that  fashion.    However,  in  the  event  that  the  curve  used  is  some 
mathematical  curve  fitted  by  least  squares,  there  are  short-cuts  which  may 
be  used  and  will  be  described  when  Multiple  Correlation  is  discussed. 


41 

It  is  also  possible  to  secure  Oz    if  the  differences  between  the  curve 

(smoothed)  value  and  the  array  mean  is  known  for  each  array.  Compute 

the  root-mean-square  of  these  differences,  weighting  each  difference  Dy 

the  number  of  itexris  in  the  array — similar  to  the  manner  in  which  Cj^ 

.0  p  p  p 

was  computed — and.  denote  -it  by  0^.      Then    0^°  =    Qx    ~  OJ^ 


42 

Ta3LE  8  -  ASSEIBLY  0?  SEVERAL  ^OEAoUHSS. 


: Value  in  exaiaple 

•         Forrnula  (not  all  formulas  given) 

:      (Table  2) 

Measure 

:  For 

.  For 

:        In  terns  01 

:     Original  in  terms 

:  assumed. 

:  original 

original  values 

of  assumed  valuer,  : 

;  values 

L_va  lue  s 

V  : 

:  ax>          pxy/oxoy  . 
.  o  r 

-r 

'    — . 569 ' 

- .  569 

d  : 

xy  : 

'   r.2  _ 

:    >:  y 

.324 

.324 

0 

z 

:                                 xy  : 

1.18 

3 .36- 

:'r°x  : 

ay 

,  r  o  _c 

  Ajr  X 

.          y  '         a  y2Gy 

;     - .  54? 

:  -.219 

a 

x  : 

•       31              n  " 

:          n  n 

:     2.  is  64 

;  (b*y°  ^ 

.815 

:  1,630 

?xy 

:  Cxy       ;  SXY  -  M^M 

:    n  n 

:  ^!LcxGy 

x  y              x  y' 

11    '  ; 

:  -1.214 

:  -12.14 

xy  . 

;  vl  -  V  ;  v  ; 

z      ;  •  roc  : 

!  . 

.70 

a; 

Sai-ie  as  Nxy  except  • 
from  smoothed  .curv 

resi  "'uals  measured  i 

Si)  lilar  to  a,. 

.  1.490! 

7  .450 

V".    Multiple  Linear  Correlation  — ' 
(1)  Regression 

The  preceding  sections  have  "been  devoted  to  methods  of  measuring 
the  relation  of  one  (independent)  variable  to  the  dependent  variable, 
and  the  reliability  or  constancy  of  that  relation.    The  following  sections 
will  be  devoted  to  methods  of  measuring  the  relation  of  several  independ- 
ent variables  to  the  dependent  variable  and  the  reliability  or  constancy 
of  the  several  relations.     Since  there  are  several  independents  instead 
of  but  one,  a  further  problem  is  automatically  introduced,  that  of  de- 
termining the  relative  significance  of  the  several  relationships  discovered. 

In  most  analyses  of  problems  it  will  develop  that  several  factors 
influence  the  given,  dependent  factor.    For  example,  there  are  numerous 
factors  which  influence  the  price  of  a  commodity.    For  convenience 
it  may  be  said,  then,  that  the  dependent  variable  (price)'  is  some  function 
of  the  other  variables  (supply,  demand,  price  level,  etc.).     This  may 
be  symbolized 

X  =  J  (A,  3,  C  . . .) 
wherein  X  denotes  the  dependent  and  the  initial  letters  in  the  alphabet 
the  independents,  and  F  means  "function  of".    The  problem  is  to  de- 
termine the  nature  of  the  function.    The  gross  correlation  methods 
enabled  us  to  determine  the  best  values  of  the  necessary  constants  when 
it  was  assumed  that  the  functional  relationship  was  essentially  linear. 
In  like  manner,  if  we  assume  that  the  nature  of  the  relationships  of  the 
4?     '  ~~  "  ' 

—'For  the  original  presentation  of  the  least  square  approach  to  multiple 
correlation  see  Tolley,  H.  R. ,  and  Ezekiel,  Mordecai.    A  Method  of  Handling 
Multiple  Correlation  Problems.     Araer.  Statis.  Assoc.  Journ.  18:  993  -  1003. 
1923. 


44. 

dependent  to  the  several  independents  is  linear,  multiple  correlation 
methods,  by  a  simple  extension  of  gross  correlation  methods,  permit  us 
to  determine  the  best  values  of  the  necessary  constants.     This  assumption 
of  linearity  may  be  given  algebraic  expression  by  writing 

x  s  b]_  a  +■  t>2     "**  hz  C  +•   

wherein  x,  a,  b,  c  represent  deviations  from  average  in  X,  A,  B,  C. 
Deviations  are  employed,  for,  as  in  simple  correlation,  there  are  advantage 
in  c-.rcoaring  deviations  from  average  rather  than  original  values.  Formula 
(39),  comparable  with  formula  (3),  implies  not  only  a  linear  relation 
hefcween  the  'lependent  and  each  independent,  but  also  that  the  components 
of  the  independents  are  added  together  before  being  equated  to    x.  This 
may  be  a  auite  inappropriate  assumption  in  some  cases.  Products 
rather  than  sums  might  be  a  more  valid  type  of  functional  relationship 
to  assume  in  certain  cases,    nevertheless,  linear  multiple  correlation 
method  is  incapable  of  comprehending  any  other  type  of  relationship  than 
that  shown  in  formula  (39).     It  is  true  that  by  using  logarithms  or 
reciprocals  or  some  other  functions  of  the  different  variabl3s  included, 
a  certain  amount  of  elasticity  in  this  last  assumption  may  be  obtained. 
Nevertheless,  once  these  logarithms  or  reciprocals  or  other  functions 
are  determined  upon,  linear  correlation  method  can  do  no  more  than  provide 
the  best  values  in  formula  (39),  no  matter  how  appropriate  or  inappro- 
priate the  type  of  relationship  therein  assumed  may  be.  Evidently 
multiple  correlation  method  provides  means  for  testing  within  but  a 
narrow  range  that  which  the  analysis  of  the  problem  may  evolve,  i.  e.  that 
the  dependent  is  some  function  of  several  other  factors.     These  limitations 


45 

must  "be  borne  in  mind  in  any  interpretation  of  multiple  correlation  results. 

Suppose  a  dependent  X,  with  deviations  from  mean  represented  by  x; 
and  similarly  deviations  in  A,  3,  C  .  ..,  etc.  of  a,  b,  c,  etc. 
Then  the  constants  b    in  formula  (39)  may  be  obtained  by  an  extension 
of  the  process  wherein  b    was  found  for  formula  (3). 

In  the  case  of  gross  regression,  formula  (3),  the  value  of  b  fas 

found  by  forming  a  normal  equation'  and  solving.     In  the  present  case 

the  several  values  of  b    may  be  found  by  forming  several  normal  equations 

■■  •  •  •  /     x  .  . 

in  analagous  fashion,  there  being  as  many  normal  equations  as  unknown 

values,  b,  and  then  solving  these  normal  equations  by  any  simulataneous 
method  the  investigator  cares  to  use. 

The  first  normal  equation  is  obtained  by  multiplying  each  observ- 
ation equation,  formula  (39),  thtru  by  the  coefficient  of  the  first 

unknown  constant  and  summing  the  product  equations  so  secured,  i.e. 

•  .  2 

al  (xl  Z  £i  ai  *l2>a  *   ^3  ci)  =  <al  xl  -  ^1  al   4  -2 

a2  (x2  -  —  1  a?  *    -^2  "^2  *    -^3  °2  *    -^-)  -  (a2  x2  -  ^1 
*  2 

a2    *  'b2-  a2-b2  *    b_3  ag<J2 ) 


an    <:cn  =  £l  an    *    ^2  \i         >3    cn>  =  (an  xn  -  £l  an 
.      *  ,  12       bn  *  .is  CJ 


Summing  gives 

2 

Formal  Equation  1  S  (ax)  =  bj_  .S(a  )  -f  b_2.s(ab) 

*  b3.S(ac) 


46 

Multiplying  each  Dbservation  equation  through  "by  the  coefficient  of  the 
second  unknown,  ho,  and  sunning  the  product  equations  gives  the  second 
normal  equat ion,  i.e. 

2 

*>1     Ui  -  ^  ax       b_2  dx  *    Jbg  cx  )  -  =  ^  bl    *  ^3  Vl  } 


2 

(x^  -  b .  a    *  h0  b    *  b„  c    )-  (b  x    -  b.a  h  *  b„  b    *  b„  b    c  ) 
n  -  — 1    n     —2    n     —3    n    -     .n  n     — 1  n  n    —2    n       o    n  n 


2  ,  x 

Normal  Equation  II.     ...     S  (by.)  -  bx  .S  (ab)  *  bg.S  (b    )  4  bg.S  (be) 
In  a  similar  manner  normal  equations  are  constructed  for  each  of 
the  unknowns.     In  the  case  of  three  independent  and  three  unknowns, 
bp,  a^d  b„,  the  normal  equations  arc: 

I.     b  .S(a2)  4  b  .S(ab)  -s-  b  .s(ac)  -  S(ax) 
1  2  "  3 

II.     b    S(ab)  *  b  .S(b2)  -s-  b  .s(bc)  =  S(bx)       )   (40- 

1  '  2  3 

III.    b    S(ac)      b  .s(bc)      b  .S(c2)  r  S(cx)  j 
The  absolute  terns  are  given  on  the  right  of  the  equality  signs 
since  arithmetically  this  arrangement  represents  a  somewhat  simpler  solution. 

If  all  terms  are  divided  by  n    the  coefficients  of  the  unknowns  in 
the  normal  equations  become  familiar  product  moments  and  standard  de- 
viations, Thus: 

I.    b,  0~2  ■ 

—1  a 


■Sfc  Pab-  ^3  Pac    -  ?ax 
II-    \  Pal>  *  b2  0-2-.  b^  Ploc    -  y   (40- 

111  •    h  pac  *"  ^2  Pbc4  ^5  °c2    S  ?cx 
An  easy  method  of  remembering  these  coefficients  is  to  imagine  a 

box-like  figure  with  columns  and  rows  designated  by  the  variables,  as 


47. 


given  below: 


a 

b 

c 

a 

aa 

ab 

ac 

ax 

b  ; 

ba  ; 

bb 

be 

;  bx 

c 

ca 

cb 

cc 

cx 

x  ax  bx  cx  xx 


In  each  compartment  the  appropriate  column  and  row  letter  designations  are 
listed  which  then  give  the  subscripts  to  the  product  moments.  Subscripts 
"aa",  "bb" ,  and  "cc"  evidently  designate  standard  deviations.  These 
latter  are  styled  "the  diagonal  terms'-' .    An  inspection  shows  a  symmetrical 
distribution  of  coefficients  around  the  diagonals. 

Thus  where  the  coefficient  of  b2  is  paD  in  the  first  row,  second 
column,  the  same  value  occurs  in  the  second  row,  first  column,  as  a  co- 
efficient of  b,  .    The  symmetrical  nature  of  these  normal  equations  adapts 
them  to  certain  time  saving  methods  of  solution;  and  it  is  eminently 
worth  while  for  the  investigator  to  learn  these  special  methods  if  he 
anticipates  the  necessity  of  solving  any  number  of  such  simultaneous  equa- 
tions.   They  will  be  discussed  briefly  in  the  section  dealing  with  arith- 
metic methods.    Any  method  of  solving  simultaneous  equations,  however,  is 
perfectly  valid  for  the  determining  of  the  values  of  b. 

The  significance  of  the  values  of  b  '.Then  determined,  is,  as  in  the 
case  of  gross  correlation,  that  by  the  use  of  these  values  the  sum  of  the 
squared  residuals  will  be  a  minimum;  and  by  this  criterion  the  values  of  b 

* 

will  be  the  "best  possible  values".    The  residuals  are  found,  as  in  the 
case  of  simple  correlation,  by  finding  estimates  of  x,  x! ,  from  the  regress- 
ion equation  and  subtracting  these  estimates  from  the  associated  values  of 
x,    i.e. , 

z  -  x  -  x* 


48 

In  the  case  of  gross  correlation,    x1    "as  determined  by  multiplying 
the  associated  values  of    y    "by  the  determined  value  of    b,  that  is  x*  =  by, 
in  which    y    was  the  independent.     In  the  case  of  multiple  correlation, 
an  anal^ous  procedure  is  followed.    The  value,  x'  ,  is  determined  "by  multi- 
plying each  associated  value  of  the  independents  by  the  appropriate  de- 
termined values  of    b    and  adding,  i.e., 

x»     -    *\  a  *  h2  b  +  b3  c   (41) 

The  residual,     z,  may  then  "be  expressed  algebraically, 

2    r  x  -  x' 

=  x  -       a  -  b2  b  _  ^  c   ;  (42) 

And  by  the  theory  of  least  squares  the  sum  of  the  squared  residuals,  S^z  ), 
will  be  a  minimum, 

S  (x  -  b]_  a  -  b2  "b  -  ^3         =  minimum. 

In  gross  correlation  the  value  of    b    was  termed  "the  regression 
coefficient" .     In  multiple  correlation  the  values  of    b    are  termed  "net 
regression  coefficients" ,  the  word,  "net",  implying  that  more  than  one 
independent  variable  is  used,  and  that  the  effects  of  one  or  more  other  va- 
riables are  removed.     And,  just  as  in  the  case  of  gross  correlation,  the 
values  of    b    may  be  interpreted  as  indicating  the  amount  of  change 
in  x    which  is  associated,  oh  the  average,  with  a .unit  change  in  the 
given  independent  variable. 

Since  there  are  several  independent  variables,  the  representation 
of    b    as  the  slope  of  a  regression  line  on  a  single  dot  chart  is  slight- 
ly more  complicated  than  in  the  case  of  simple  correlation,  nevertheless, 
this  representation  can  be  made  and  is  quite  helpful.     It  is  essential 
to  a  comprehension  of  multiple  curvilinear  correlation,  which  will  be 
described  later. 


49. 

Before  a  dot  chart  and  a  regression  line  representing  the  effect  of 
a  variable,  a,  on  the  dependent,  x,  can  "be  constructed,  the  amount  of 
influence  of  the  other  variables,    b    and    c,  must  be  eliminated  from  x, 
or  else  the  true,  or  net,  relation  of    a    to    x  will  be  obscured.  This 
means  that  we  cannot  plot  the  values  of    x    against  the  values  of    a  in 
making  a  dot  chart,  but  that  we  must  plot  the  values  of    x    corrected  for 
the  influence  of  the  other  independents  -  with  the  effect  on    x    of  these 
other  independents  removed  -  against  the  values  of  a. 

Since  the  amount  of  influence  of  -  b    and    c    on  ■  x    is  given  by  the 
net  regressions  of    x    on  these  two  variables,  the  removal  of  this  influence 
may  be  accomplished  for  any  instance  of    x    by  merely  substracting  from  x 
the  quantities  bg  b  and  bg  c    for  associated  values  of    b    and    c,  i.e. 
x    corrected  for  the  influence  of    b  .  and    c  is  given  by 

; .;.       x  r  b2  b  -     c  =  j  (43) 

If    j,  then,  is  plotted  against  the  values-  of    a,  a  dot  chart  will  be  secur- 
ed which  will  show  graphically  the  net  relation  of    a    to    x.    And,  if 
on  this  dot  chart  a  straight  line  be  drawn  to.  pass  through  as  many  of  the 
dots  as  possible  -  or  rather,  be  drawn  so  that  the  summed  squared  devia- 
tions from  the  line  .  (in  the  " j"  -direction)  will  be  a  minimum  -  this  line 
will  have  a  slope  b.^.     It  would  be  identical  with  the  slope  of  a  line 
obtained  from  determining  the  'regression  of    j    on    a  "by  the  method  of 
gross  correlation.. 

The  identity  of  the  two  may  be  easily  demonstrated  by  showing  that 
any  gross  regression  coefficient  is  such  that  the  correlation  between  the 
independent  and  the  residulas  is  zero,  and  by  then  showing  that  the  correl- 
ation between    (j  -  £1  a  a  z)    and    a    is  also  zero. 


50 


Thus  when  b  is  determined  in  x    -  by 

b    s S (xy)      and    x'     =  y  S(xy) 


S(y2) 


and 


=  x  -  x 


Then 


-  x  -  y  S(xy) 
■    a  S(y2) 

s  fy(x  -  y-Sl^zi  ) 


=  1 

n 

=  I 
n 

=  0 


'-'  S(y2)  - 

S.(xy)  -  S(y^)  S(xy) 

S(y2)J 


No  other  value  of  b  than  S(xy)  will  give  p      =0.     The  correlation 

...    .  ~s(r)  yz 

between  (j  -  ]b_  a    =  z)  and    a  may  r-lso  be  shown  to  eoual  zero: 


Thus 


Then 


2      =      j    "  -1  3 


-  1 


az 


=  x  -  b    a  -  b2  b  -  b^  c 
. S  f  a ( x  -  b    a  -  b    b  -  b    c )  ~j 


n  ~1         ~2         ~3  'J 

=  l'[S(ax)  -  b     .Sta2)  -  b     .S(ab)  -  b  .S(ac)] 
n  ~1  ~2  ~3 

=  P 


-  b    Cr*  -  bo  j      -  b  ? 
~X    a       _2  r  ab     ~3  .  ac 


But  by  the  solution  of  the  normal  eauations   "[^-3  » 

b^  0~2  j-bt)      v.bp  ~p 


ab 


-3 


-,c  ~  ax 


*az  -    ^ax  ~  ^ax 


hence 

*az  5    *  ax       *  ax 

This'  is  an  important  theorem  to  remember;     Independent  variables 
are  un'corr elated  r/ith  the  ^residuals ;  and  only  the  least-so^are  values,  of 
b  will  uroduce  this  result.     This  uroves  the  identity  previously  mentioned. 


51. 

The  dot  chart  with  ordinates  of    j    and  abscissae,  a,  having  "been 
constructed,  this  chart  then  shows  the  net  relation  betv/een  the  variable, 
a,    and    x.     In  an  exactly  analagous  manner,  charts  may  he  made  to  show 
the  net  relation  "between    x    and  the  other  independents,    b    and    c,  the 
ordinates  in  one  case  being  (x  -  b-^  a  -  bg  c)  and  in  the  other 
(x  -  b:1  a  -  b2  b) . 

The  sum  of  the  squared  residuals  is,  of  course,  the  same  for  all 
the  charts,  since  a  residual  is  in  each  case  given  by  formula  (42). 

We  have  now  developed  methods  for  finding  and  representing  graph- 
ically the  relationship  of  several  independent  variables  to  the  dependent 
upon  assumptions  implicit  in  formula  (39) 

(2)  Correlation 

The  next  proposition,  as  in  the  case  of  gross  correlation,  is  to 

develop  some  meagre  of  the  consistency  or  reliability  of  the  relation- 

•ships  discovered.     As  in  that  case,  a  logical  procedure  is  to  compare 

the  standard  deviation  of  the  residuals  with  the  standard  deviation  of 

the  original    x    values.    And,  as  in  that  case,  some  relation  between 

the    0~,    O^i     and  0~i    may  be  found  to  enable  us  to  compute  the  measures 

easily.    As  in  the  case  of  gross  correlation, 

0-2  -  CTi2  4  0~2 
x    -    x  z 

is  found  to  express  the  relationship  between  the  three  values,  as  follows: 


Each    z    -    x  -  x' 


2  • 
Each    z    -    (x  -  ]>,  a  -  Tbg  t 

Expanding,  summing  and  dividing  by  n 


2S  c 


b3  c) 


n2 


CT    -  CT    —  b,  "o    -  b 


z    -  x 


33  Pb; 


k-7  V. 


'  cx 


-  *1    Pax    4    V°a    *  %  ^2  pab  *    -\  ^3  Pcc 
pbx    4  ^2Pab 


b20~2  $  b    b  -n 
-2Jb     '  -2  ~3  pbc 


■         ■  2  —2 

-  b      p        -r     d    b_  p      4    b^  b    p      *    u„  0 
-3    pcx        -1  -3  Fac       -2  -3  ^bc       "3  c 


But  taking  the  first  normal  enuation  and  multiplying  thruout  by  b,  we 
have 


"1 


(T      *   iK  ■  b 


^    _-_2    pab    4    \-b3pac    =   *1  pax 


(45 


b    p      may  therefore  be  substituted  for  the  equivalent  three  terms  in 

the  second  row  of  the  expanded  square  aboire.     Substituting  in  like  manner 

for  the  eouivaients  of  L_  p.,     and  b„  p         from  the  second  and'third 

■~2    bx         —3  cx 

normal  equations,  and  collecting  ierms,     (45)  becomes 

2  2 


-J-    b_  "o  ) 
z    -     x  v— 1  -  ax       —2  x  bx  — 3  A  cx 


-    (X     -        p___  "P 


This  is  a  perfectly  general  development  applicable  to  any  number  of  variables. 
It  now  remains  to  be  si 
(46),  are  equal  to  6Zt' 


It  now  remains  to  be  shown  'that  the  terms  enclosed  in  the  parentheses,  formula 

2  . 


and 


Each  x1 
S(x'2) 


-  b  .,  a         b  b 

—  — J.  ~ '  c, 
2r-2 


£3c 


*V°7*  %VPab* 


ac 


b    b    p      j..    b  2     rr-2_:.    b    b_  p, 
-1-2  *ab       -2      ub      —2-5  *X>c 


b    b    p      +    b    b    p      *    b  2  cr2 
-1  -3  '  ac       ~2  -3  "  be       ~3      c  y 


53 


(4?) 


-  °a2  *   ^  *ab  *    ^3  *ac  > 

'       *  h%  ?ab  *    *fe  °b2  +    ^3  ?bc  ) 

But  the  terms  enclosed  in  the  three  parentheses  in  (48)  are  from  the 
normal  equations  respectively  equal  to  p     ,    p-^  and    pcx  . 


(48) 


Hence 


0^' 


..(49) 


cr2-   a- , 2^  or2 

X  -      x'  z 


&  Pax  4    &2  Pbx  4    ^3  Pcx 

Formula    (49)    is  a  very  useful  formula  and  should  be  remembered. 

Substituting    (49)     in    (46)    we  have 

0~2r    o~2_    o~i2  or 

,2*    Cr2  (50) 

which  is  identical  with  formula  (18). 

Just  as  in  the  case  of  simple,  or  gross,  correlation,  a  measure 

of  the  closeness  and  reliability  of  the  relationships  discovered  may  bd 

had  by  expressing    02*      as  a  percentage  or  decimal  fraction  of    0~  , 

2 

and  a  coefficient  of  "alienation"  by  expressing  Cf_  as  a  percentage,  or 
decimal  fraction  of  02  .    The  relationship  between  the  two  measures  is> 


01  course, 


OZ 


,2 


fT2 


or 


=  1 


2 

If      x'      is  reduced  to  the  first  order  it  becomes    °x 1      and  may  "be 

2  •        •'-      •     •  ■ 

u:c  ax  . 

designated,    Rx_ r^c  '/herein  the  stib  scripts  to  the  right  of  the  point 
designate  the  independent  variables  employed.    Rx    >)c  is  classically 
known  as  "the  coefficient  of  multiple "  correlation"  ,  and  literally  re- 
presents the  decimal  fraction  that  the  standard  deviation  of  estimates 
is  of  the  standard  deviation' of  the  original    X    vaiu.es.  r>)Cis  also 

numerically  equivalent  to  the  ordinary  gross  coefficient  of  correlation 
between  the  original    X    values  and  the.  estimates',  X';  i.e., 

^x.abc  =    rxx''  ' <51> 

This  is  a  useful  concept  in  interpreting  Rx  ^     and  the  equality  between 

the  tv'p  may  be.  proved  easily,  as  follows:. 
•  •       Rx.abc   =  _^Ll_ 


a*ld  r  •=    -n  , 

xx1  xx [  \     .  T)y'  definition, 

c\ro„  i 

but  ^xx'        =    1-S  Cx^i  a  +   h.  h  *   ^5  =  >1 

n  J 

..  =    l.fj2rS  (xa)  '+  b^.s  (xb)  +  b^.S  (xc)] 

... b_  v      +   h  ••.->  T  + h~  ^      . .  .v  (5?) 

1    xa       ""2    xb  — o    xc  v  ' 

But  by  comparing     (52)    •  uth.  (43)     it  "becomes  apparent  that 

^xx'     =    V'2  (53) 

¥.i  nee  substituting  for  v.-  • 
63  xx  » 


2 

xx 


—      u_ ,  ■ 


X 


-      U  v  I 


=    R  , 
x.aoc 


(54) 


55 

Formula    (49)    provides  a  simple  raid  easy  way  of  computing 
once  the  constants  in  the  multiple  regression  equation  are  determined. 


_    \  ?ax    *    ^3  Hx  *  Vc:c    (55) 

The  evaluation  of  formula    (55)     requires  no  other  values  than 

 o 

those  already  provided  "by  the  solution  of  the  normal  eouations.     If  0„ 
is  known,  a  simple  way  of  computing    Br  si-c    is  from  the  relationship. 

e2  \  -  i  -  5!   ^ 

z .  a  dc   

This  formula  becomes  of  great  importance  in  handling  curvilinear  multiple 
correlation  problems. 

TL   o1jc     is  a  coefficient  of  correlation  which  gives  a  measure  of 
the  reliability  of  the  various  regression  relationships  discovered,  when 
taken  as  a  group,  or  in  another  sense  it  is  a.  measure  of  the  reliability 
of  estimating  the  dependent  from  its  discovered  relationship  to  the 
several  independents.     It  does  not,  however ,  provide  any  means  of 
apportioning  importance  to  the  various  independent  variables.     This  is 
the  next  problem  before  us. 

(3)    Determination,  part  and  partial  correlation. 

Recalling  the  coefficients  of  determination  developed  in  connection 

with  gross  correlation,  it  was  found  that  (CT!^/0~ *    a    d     )  represented 

x    '   x  xy' 

the  proportion  of  total  squared  variability  in  the  dependent  that  was 

attributable  to  the  independent,    y.     An  analagous  procedure  may  be 
employed  in  the  case  of  multiple  correlation. 


56 


in  th 


Thus  from    (50)     the  proportion  of  the  total  squared  variability 

2  p 

n  de-oGndont  ?.ttri"but -hie  to  the  independents  is  given  "by    CC'  /O".  . 

.*2 


It  becomes  necessary  to  apportion  the  variability  of    CvT      to  the  different 

vari-blcs  which  make  up  xT ,  end  thus  attain  our  object.    This  may  be 

2 

pccompliGhed  from  a  consideration  of  formula    (49).     Here    0^'      is  de- 
fined ?.s  the  sum  of  three  terms,  one  each  for  each  independent.    The  pro- 
portion that  e-ch  independent  contributes  to  the  squared  variability  of  x1 

nry  be  determined  then  by  taking  the  decimal  fraction  that  each  of  the 

2 

three  terms  is  of    CT1  .    To  relate  these  fractions  to  the  independent 
it  is  necessary  to  multiply  by  the  proportions 

cr.»  /cr 


Thus 


dxa,bc    -    ^1  Pax 


dxb .  ac 


dxc.ba 


Ox 


J2 
-  kLPax 


x 

b~  is 
_  —  3  *  ex 


07, 


x 


°x'2 
Ox  2 


(57) 


dx- .be    symbolizes  the  "net  determin^ti on  of    x    by    a".     Subscripts  to 
the  right  of  the  point  designate  the  other  independent  viri-bles  included 
in  the  study.     The  order  in  which  these  latter  subscripts  are  fisted  is 
immaterial.     The  first  of  the  subscripts  to  the  left  of  the  point  should 
alv/ays  designate  the  deuendent ,  the  second,  the  independent  under  consider- 
at  ion.  — ' 

5/    For  original  treatment  of  coefficients  of  determination  see  the  f ollo"?in 
Wright,  S.     Correction  -nd  Causation.    Jour.  Agr.  Research  20:    557  -  585. 
1921.     Smith,  B.  B.     Forecasting  the  Acreage  of  Cotton.     Amer.  Statis-  Assoc 
Jour.     20:    31  -  47.  1925. 


57. 

'Inspection  of  the  formulae  for  coefficients  of  determination  shows 
that  whenever  the  net  regression  of    x    on  the  given  independent  is  of 
opposite  sign  from  the  gross  regression  (which  takes  the  sign  of  the  pro- 
duct moment)  of    x    on 'the  independent,  the ; coefficient  of  net  determination 
is  negative.     This  makes  the  interpretation  of  these  coefficients 
difficult.    When  such  a  condition  arises  it  is  usually  due  to  relatively- 
high  inter-correlation  among  independents,  as  compared  with  the  correlation 
between  independents  and  dependent.     It  may  "be  interpreted  that  the 
influence  of  one  variable  upon  the  dependent  is  greater  through  a  second 
variable  than  directly,     It  is  thus  traceable  back  to  the  inappropriate- 
ness  of  the  implicit  assumptions  of  the  multiple  as t    regression  equation. 
There  are,  however,  certain  characteristics  of  determination  coefficients 
which  give  them  definite  meaning.    The  sum  of  them  equals  K  ,  or  the  total 
determination  of  the  dependent  by  the  several  independents.  Furthermore, 
if  two  or  more  terms  such  as  b^  a    and"  b^  b    be  added  together  prior  to 
the  computing  of  x' ,  then  the  determination  by  the  joined  series  may  be 
found  by  adding  together  the  coefficients  from  the  separate  series;  ?.e.» 

dx(a  +  b).c    =  ixa.bc  *  dxb.ac  (5?) 
This  is  a  useful  theorem,  for  in  the  event  that  negative  determinations 

appear,  they  may  be  added  to  the  determinations  by  other  closely  related 
independents;  and  the  determination  of  that  particular  group  of  independents, 
as  a  whole,  may  thus  be  ascertained. 

Another  method  of  measuring  the  relative  importance  of  a  given -inde- 
pendent to  the  dependent  may  be  developed  by  recalling  to  mind  the  dot  graphs 
described  as  a  means  of  showing  graphically  the  net  relation  of  the  various 
independents  to  the  dependent.  Here  the  ordinates  were  in  each  case  x  minus 


58 

the  functional  contributions  (such  as    b_2  b    and  b3  c)     of  all  other 
independents  except  the  given  one.     Thus  in  one  case  the  ordinates  were 
(j    -    x    -  b.^    b    -  b    c)     and  the  abscissae  were    a.     The  regression 
line  h^d  a  slope  of    bn .     (taking  this  individual  dot  chart  T  we  have 
essentially  a  gross  correlation  problem,  j     being  the  dependent  and  a 
the  independent.     Ordinary  gross  correlation  methods  may  therefore  be 
used  to  measure  the  importance  of    a    to    j.  Thus 

°J  (58) 

This  is  but  the  relation  of  the  standard  deviation  of  estimates  of    j  re- 
presented by  points  on  the  regression  line,  compared  to  the  standard 
deviation  of    j  itself. 

But  since  each    j '     '  =    b-^a   w9'-) 

Then         ,  G7 1      -    b-,  CT  -  -......(60) 

J        -    — i  a 

Hence  r         -    ^(ra'    .J6'1) 


The  value    Cj    may  be  determined  as  follows: 

The  basic  formula,     C"3  -    0~'2  -c-    CT2    may  be 

paralleled  and 

°j2  2  ^12°^,2  -    Ol2  ."*' (62) 

Oz    is,  oi  course,  given  by  the  formula 

.    CTZ2  -    0-2  (1  -  H^bc*  -   (63) 

Substituting     (63)     in  '  (62) 

Oj2  =  (1  z  Blabc)    (64) 


59 

Substituting     (64)     in  (61) 


r2;  ^  b  2  -or2 

aj       -  1  a 


'  1';'"  "  .,....(65) 


1    *    Slil  -Rlaoc^ 
■■'    »  -1  a 

This  is 'the  'definition  of  an,  as  yet,  little  used  measure,  b/    and  nay  be 
called  the  "coefficient  of  part  correlation"  as  contrasted  with  "the . 
coefficient  of  partial  correlation",  which  has  a  different  meaning.  It 
is  'always  positive  in  sign,  varies  between  the  limits  of  0.0  and  ->  1.0, 
is  quite  easily  computed,  and  perfectly  general  for  systems  of  any  number 
of  inscribed  independents.     It  is  only  necessary  to  insert  the  proper 
]b2  and  0~2  values  in    (65)     to  determine  it,  the  other  values  in  (65) 
remaining,  constant  for  any  given  system. 

A  generalized  notation  may  "be  developed,  thus 

■  _T  

ar2x  "  bede         n    =    1    *.    07.2  (1  -  3|.a^c,,.n>   (56) 

-ixa.bcd. .  .,.n  a 

The  first  subscript  to  the  right  of    r    designates  the  dependent,  the 
subscripts  following  the  minus  sign  represent  the  independent  whose 
functions  have  "been  subtracted  from    x    and  the  subscript  to  the  left  of 
r    the  remaining  independent  which  is  correlated  with  the  dependent  when 
so  treated..  •  5  •'• 

Subscripts  to    To    are  similar  in  meaning  to  subscripts  for  d 
alreedy  explained  -  the  first  one  designating  the  dependent,  the  second 
the  given  independent  and  subscripts  to  the  right  of  the  point  the  remain- 
ing .L-„  

hj    This  measure,  was  worked- .out-  together  by  Mordecai  Ezekiel  and  B.  B; 

Smith  and  is  here  published  for  the  first  time.  - 


60 

independents  included. 

arx    -    bede  . . .  n    is  literally  the  plus  or  minus 
coefficient  of  correlation  between 

(x  "  ^xb.acd...n^  -\c.abc...nc  .     .     .  -  \n.abc. . •  (n  - 

and  a.  •        -W  !T 

Measures  of  "determination"  and  of  "part  correlation"  have  been 
developed  in  the  above  as  means  of  measuring  the  importance  of  the 
various-  independent  variables  to  the  dependent.     A  third  measure  which 
is  classical  in  its  use,  but  much  more  laborious  to  compute  will  now  be 
described.     This  is  the  coefficient  of  "partial"  or  "net"  correlation  Zf 

If,  as  before,  we  take  the  case  of  four  variables  (deviation  from 
means)  x,  a,  b,'  c  and  determine  the  partial;  correlation  of  .  x  and  a, 
expressed  by  rx  this  could  be;  dene  as- follows: 

(1)  Find  the  values  -oft)    in  the  following: 

X     =     b  ,      b    .4-     b       ,  c 
— xb . c  -xc . b 

and  in  a    -    b       b^b  c 

ab.c  ac.b 

(2)  determine  the  two  sets  of  ■"residuals 

. x    >*b.cb  *  ^c.bc)  ■  zi 

(3)  .     and  find      r         -  r 

•  z\Z2  xa.be 

The  value  of  r -  may  be  determined  from  the' following  con 

s idem t ions  : 

7,2     is  uncorrela'ted  with    b"    or    c,  i.e.', 

\.  ..... 

•.  2  •  .............  (67) 

r         -  0  •'" 

ZoC 

— J  _  

7/    Yuie,  G-.  U.    ■  An  -Introduce  ion  -to-  the  I  meo.ry ,  oi  jj.tat  is  t  ics'. — 6th  ed — cnl 
London,  1923,  chap.  XII /  •       ■  *•  ;■'     •'    "'*;  ' 


61 

This  Is  because,  by  a  previously  demonstrated  theorem,  residuals  are  un- 
corrected with  independents,  the  -oroduct  moments  being  zero.     (See  formal 
( H)  et  sea. )  ■  • 

In  a  similar  manner 

Zl  0..   (68) 

r       -  0 

;    .       V   ~  . 

If  now        and  z^    be  correlated  and  z-\     be  estimated  from  Zg  by  the 
resulting  "regression  eouation,  these  estimates  may  be  subtracted  from 
Z]_  giving    z«    or  the  error  of  estimating  z-^  from    zP,  i.e., 

23      Z    zl  -    3^2     (69) 

wherein  b_    is  the  gross  regression  of  Zi  on  zp. 

Then  by  application  of  previously  developed  theory, 

It  remains  to  determine  the  values  of.    (7^3    and  p^i 

Now  it  may  be  shown  that  both    b    and    c    are  uncorrelated  with 
zg    for,  talcing  the  variable,    c,    .as  a  case, 

p       may  be  shown  to  -equal  zero: 

CZ3 

Thus  pc^3    =    1  .S(c.z3) 

n 

=    I    .S[c(zi  -  b    z2  )] 

J.J. 

-    PCSl-^    Pcz2  ....(71) 


But  by    (67)    and    (S8)    both    pe2  and  pC2  are  zero  and  hence 


62 


PCZ3    =    0   (72) 


in  . 
O 


Similarly,    b    is  also  uncorf elated  with  z. 

Not  only  is    z3  uncorrelated  With    b    and    c    but  it  is  also 
obviously  uncorrelated  with    z2-     tt  is  therefore  also  uncorrected  with 
a,  for 

P    k      z  0 
  •  Z322 


-      "0  r   I   ,       U,  -   b         ,  T3 

~    *az3     ■k-ab.cJ^tZ3     -o.c,o  CZ3 
But  since  and    Vcz^  are  zero»  It  follows  that 

paz3  3  Pz3z2 

-    0   (73) 

In  short,     z3  or  the  errors  of  estimating    z^    from    z2,    arc  uncorrelated 
with  all' the' variables  in  the"  system,  with-  the  exception  of  x.     It  followsj 
therefore,  by  the  theory  developed  in  connection  with  formula    (44),  that 
the  values  of    23    are  those  that  would  be  secured  if  estimates  of  x 
were  secured  from  the.  multiple  regression  0jq-u.ati.on, 

*'    *    ^a.bca  *  ^xb.acb    »    ^xc.abc  .  . 

and  subtracted  from    x,  i.e., 

z„      _    x    -    b      .  a-  b  .       b  -  b  c 
>3     «  -xa.bc     -*xo.ac  """xc.ab 

and  thus  jjg    -    ^2  ^  ~  "x.abc>    - 1 (74) 

The  value    CJI2    may  be  written,  of  course, 

0-4  =   .••••<») 


63 

The  equivalents  given  in    (74)    and     (75)    may  be  substituted 
in  (70)     to  give,thea,,a  formula  which  may  be  utilized  in  the  computation 
of  partial  coefficients  of  correlation: 

rJLc    •    1    "      <,  (1  *   

cr   (l-  p3  -  ) 

x     v  x.bc  y 

In  practice,  of  course ,  the  (312    values  cancel.     This  formula  represents 
a  nev:  concept  of  the  coefficient  of  partial  correlation  -  it  is  a  functic 
of  the  ratio  of  the  errors  of  estimating    x    from  all  the  independents 
to  the  errors  of  estimating    x    "from -all  the  independents  except  the 
given  one.     Or,  again,  it  is  a  function  of  the  amount  that  the  error  of 
estimating    x    is  reduced  by  including  the  given  independent  in  the 
estimating.     This  last  is  a  convenient  concept  of  the  partial  correlation 
coefficient.     But  its  literal  meaning  should  not  be  forgotten  -  it  is 
the  correlation  between  any  two  vari.-bles  in  a  system  after  the  effect 
of  all  other  variables  in  the  system  have  been  eliminated  from  both  of 
them,  by  least  sauaye  correlation  methods . 

The  formulae  for  the  partial  correlations  of  the  other  variables, 
b    and    c,     similar  to  (76) 

xb .  ac 


r2/         =    i    _    °x2  d.-  -Plabc) 


0~J  (i  -  ) 
wx    v        -^x.ac  ' 

 ;  (76-a) 

rl  ab    .    1    -  (1  -  %,abc) 

XC  -  ciJ      —   7y  p  

°x    A*1 sx:.ab  ' 

Fote  that  in  these  equations  the  numerator  of  the.  fraction,  representing 
the  absolute  standard  error  (0~5^d)     of  estimating  the    x    term,  (z-i), 
from  the  other,   (zo)    does  not  change  for  the  various  partials.  The 
difference  in  the  partial  correlation  for  the  different  independent 
variables  is  thus  attributable  to  changing  values  in  the  denominator  of 


/  .'  64 

t 

the  f r-.  ction  which  represent  ;>  the  variability  remaining  in  xt  (a  ),  after 

elimination  of  the  influence  of  variables  indicated  by  subscripts  to  R 

to  the  right  of  the  point. 

The  greatest  value  of  a  partial  will  occur  when  the  denominator 

 o  2 

term  becomes  largest.     The  limit  is  evidently  when  PT  %c     is  zero. 

In  this  event  the  partial  coefficient  becomes  equal  to  the  multiple  co- 
efficient.   This  is  a  useful  theorem.    Partial  coefficients  can  never 
be  gre  ter  than  the  multiple,  or  conversely,  the  multiple  coefficient  is 
always  rs  large  .and  usually  larger  than  the  largest  partial  coefficient. 

Partial  coefficients  of  correlation  as  computed  by  formula  (76) 
may  be  token  rs  either  the  plus  or  the  minus  root  of  the  squared  value. 
It  is  customary  to  give  the  coefficient  the  sign  of  the  net  regression 
coefficient  which  has  subscripts  identical  to  it. 

Before  leaving  the  subject  of  partial  correlation  the  generalised 
formula  may  be  given. 


r. 


2 

3  ,  -%.,abc..,n)  .  x 

m.abcd"-  ^n  -  1)    s   g.  )   (7?) 

Cl  -  E^.0-bC  . . .  (n  -  1) 


Although  the  development  of  the  theory  of  partial  correlation  co- 
efficients has  been  given  with  particular  reference  to  four  variables  only, 
it  will  be  recognized  that  this  development  is  so  presented  as  to  be 
perfectly  general  in  its  application  to  any  number,  of  variables.    A  limited 
number  of 'variables  were  employed  in  order  to  avoid  confusion  resulting 
from  too  profuse  subscript  notation  otherwise  necessitated. 


65. 

(4)    Arithmetic  methods. 

A  complete  and  very  detailed  description  of  the  arithmetic  methods 

of  working  out  multiple  correlation  solutions,  whether  they  be  large  scale  or 

8/ 

small  scale  in  scope,  has  already  been  prepared  — ' ,  and  it  does  not  seem 
advisable  to  burden  this  discussion  with  the  bulk  of  that  description. 
Only  the  more  general  points  will  here  be  discussed. 

There  arc  two  laborious  tasks  in  determining  not  regression  coeffi- 
cients and  multiple  correlation  coefficients.    The  first  is  to  compute  the 
necessary  product  moments  and  standard  deviations  to  make  up  the  normal  equa- 
tions, the  second  to  solve  these  equations. 

Since  the  preparation  of  the  various  product  moments  and  standard 
deviations  necessary  to  the  forming  of  the  normal  equations  involves  the 
inter-multiplying  of  all  possible  pairs  of  variables  and  the  squaring  of 
all  variables,  it  is  eminently  practical  to  reduce  these  variables  to  as 
simple  arithmetic  values  as  possible  prior  to  their  multiplication.  This 
involves  only  the  coding  of  these  variables  as  previously  explained  in 
connection  with  gross  correlation.    A  systematic' notation  of  all  coding 
processes  should  be  made,  however,  in  order  to  avoid  confusion  at  a  later 
time  when  it  comes  to  decoding  the  results.    Thus  if  C    represents  the 
code  of  any  value,     C;  the  value  of  C  in  terms  of  C  should  be  noted. 
This  relationship  can  always  be  expressed  in  the  form  of  a  linear  equation, 
i.  e.,  C  -  k,  *      k?  C.When  it  comes  time  to  write  the  multiple  regression 
equation  in  terms  of  original  values,  it  is  only  necessary  to  substitute 
for    c    its  equivalent,  k^  *  kg  C    and  simplify  the  equation,  involving 

s?  "       1        1        '    '  ':  "' 

Smith,  33  .  33  .    Use  of  punched  Card  Tabulating  Equipment  in  Multiple  Corre- 
lation Problems .  U.  s.  Dept.  Agr.,  Bur.  Agr.  Scon.,  1923.  24  pp.  Mimeo- 
graphed. 


66 

^nly  ordinary  algebraic  processes. 

The  process  of  multiplying  together  and  squaring  the  variables, 
or  "making  the  extensions"  is  simplified  if  the  coded  values  are  used 
directly,  rather  than  the  deviations  from  means.     This,  requires,  of 
course,  the  use  of  formulae  such  as     (29)    and    (31)     in  the  computation 
of  the  product  moments  and  standard  deviations.    But  each  investigator 
will  auickly. work  out  for  himself  methods  of  systematizing  these  process- 
es, or  he  may  find  them  already  prepared  for  him  in  the  aforementioned 
publication. 

It  should  be  noted,  that  once  the  variables  are  coded,  the  fact  that 
they  have  been  coded  may  be  forgotten  in  all  phases  of  the  interpretation 
of  results,  save  only  in  the  case  of  regression  coefficients;  correlation 
coefficients  of  all  kinds,  and  coefficients  of  determination  will  be 
identical  with  those  that  would  have  been  secured  were  original,  uncoded, 
values  employed.    Only  the  regression  values  are  changed  by  the  use  of 
the  codes . 

^fter  .the'  variables  have  been  coded,  and  prior  to  any  extension 
of  them,  it  is  advisable  to  introduce  a  "check-sum".    This  check-sum  is 
merely  ths  sum  of  all  associ-ted  (coded)  values  of  independents  and  depend- 
ent.    It  serves  as  a  merns  of  checking  all  extensions  and  also  carries 
on  through  the  solution  of  the  normals  as  nn  almost  complete  check  on  all 
arithmetical  work.    The  operation  of  this  check-sum  in  chocking  extent ions 
may  be  explained  as  follovs: 

Suopose  variables,  A,  B,  C,  and  X  and  a  check-sum,  U«  A*3*  C  «X 
then  it  obviously  follows  that 

S(A2)  -i-  S(AB)  *  S(AC)  -*  S(AX)  =  S[a(a  -»  b  -f  c  *  x)  ] 

=  S(AU) 


57 

2 

The  computation  ox"  S(AU)     serves  to  check  the  confutation  of  S(a"),  s(A3), 
S(AC)  oM  5 (AX),. 

if tor  the  normal  equations  have  been  prepared  the  method  of 
solution  odvocated  is  the  D  .little  tylethod  —I .    The  arithmetic  processes 
of  this  method  as  applied  Specifically  to  multiple  correlation  problems 
miry  "be  found  in  detail  in  an  aforementioned  publication  of  the  Bureau  of 
Agricultural  Economics  . 

It  may  be  remembered,  however,  that  my  standard  method  of  solving 

simultaneous  equations  is  valid  for  the  determination  of  the  values  of  b_, 

(5)    Use  of  Multiple  Correlation  Methods  for 
the  Fitting  of  Parabolae. 

If  it  be  assumed  that  the  relation  of    Y    to    X,  X  being  the  de- 
pendent, is  of  the  nature  of  a  parabola  rather  then  of  a  straight  line, 
instead  of  using  approximation  methods  discussed  in  considering  the  correla- 
tion index  to  find  this  curvilinear  relationship,  multiple  correlation 
methods  may  be  used. 

The  assumption  is,  of  course, 

X  -    Z    -f    bxY    *    b2Y24...    *    2lnYn   (78) 

All  that  is  necess  ry  to  do  is  to  substitute  the  appropriate  powers  of 

Y    for  the  independent  Variables  in  the  usual  multiple  regression  equation: 

x    =    ^1     -;    ^2*    *         *  Kn 

and  solve  for  the  values  of  b^     Writing  the  multiple  regression  equation 

in  terms  of  ori^in^l  values ,  rather  than  deviations  fromaveregc,  supplies 
the  value  of  X    in  (78). 

9/    The  theory  end  method  of  this  solution  nvy  be  found  in  the  following: 
Adems,  0.  S.    Geodesy  -  Application  of  the  Theory  of  Least  Squares  to 
the  Adjustment  of  Tri angulation.      1915.    U.S.  Coast  and  Geodetic 
Survey.     Spec,    Pub.  28. 

Wright,  T.  W,  mi  H-yford,  J.  F.  Adjustment  of  Observations  by  Method 
of  Le  st  Stpastreewith  Applications  to  Geodetic  Work.    2d  ed.  N.Y.,  1906, 


OO 

Since  there  is  a  constant,  mathematical  relationship  between  Y 

and  its  powers,  only  one  curve  on  a  coordinate  graph  "is  necessary  to 
describe  the  relationship,  rather  ■  then  one  curve  for  each  variahle  ns 
in  the  usual  multiple  correlation.    This  curve  can,  of  course,  be  de- 
termined by  assuming  values  of  Y  in  (78)  end  evaluating  for  X,  which 
then  gives  ~s  many  points  on  the  curve  as  one  c^res  to  compute.  The 
curve  may  then  bo  drawn  to  pass  through  the  points. 

If  in  the  process  of  the  analysis  of  the  relation  of  X  to  several 
independent  variables,  it  is  assumed  that  the  net  relation  to  one  of 
them  is  best  described  by  a  parabola,  rather    than  a  straight  line,  the 
necessary  powers  of  the  given  independent  variable  any  be  introduced  "s 
new  independents  and  the  multiple  correlation  proceed  as  usual.  The 
coefficient  cf  multiple  correlation  m°y  be  computed  by  the  usual  process. 
The  computation  of  partial  correlation  coefficients,  wherein  the  several 
peers  cf  the  given  independent  are  treated  jointly,  is  too  complicated 
to  be  practical.     The  coefficient  of  determination,  however,  nay  be 
secured  by  simply  adding  together  the  several  coefficients  computed  for 
the  various  powers  of  the  given  independent.     The  coefficient  of  part 

correlrtiozi  may  be  computed  by  substituting  for  the  term, 

2  2 

10    -  *  cr, 

xa. ocd. . ,n  a 

in  formula  (66)     a.  term  which  is  equivalent  to  the  squared  standard 
deviation  of  contributions  from  the  various  powers  of  the  given  variable 
"dded.  together.     Sueposing  that!  the  variably    b    represented  vrrieble 


59 

a,  squared.     The  term  would  then  be 

l.S(b    a    -s.    b  b)2 
n 

This  value  would  then  bo  substituted  for  b,2  j;2  in  formula    (65).  Its 

— 1  c- 

comoutnion  involves  only  the  net  regression  coefficients  and  standard 
deviations  and  product  moment  already  computed. 

The  multiple  correlation  method  is  of  course,  adaptable  to  the 
fitting  cf  other  types  of  curves  susceptible  to  determination  by  methods 
of  least  squares. 

VI .     Multiple  Curvilinear  Correlation 
Linear  multiple  correlation  assumed  that  the  functions  in  the 
.following  equation  r/ere  linear  and  thus  provided  v.* eights  or  regression 
coefficients  defining  the  slope  of  the  straight  lines":, 

X    S  4    F2(b)    *  F3U)  Fn(n)    (79) 

Curvilinear  multiple  correlation  makes  no  assumptions  as  to  the 
nature  of  the  functions  save  that  they  may  be  represented  by  a  smooth 
curve.     It  permits  the  data,  of  themselves,  to  reveal  the  nature  of  the 
functions.    The  functions  are  found  by  first  assuming  certain  curves 
to  be  descriptive  of  the  functions  and  then  by  methods  of  simultaneous 
approximation  these  assumed  curves  are  adjusted  and  modified  so  as  to 
give  minimum  squared  residuals  mien  applied  to  the  independents  in 
estimating  the  dependent. 

It  is  apparent  that  formula    (79)    represents  a  much  broader  case 

10/  Tor  original  presentation  of  curvilinear  correlation  see  Esekicl, 
Mordecei.  A  Method  of  Handling  Curvilinear  Correlation  for  Any  Number 
of  Variables.     Amor.  Statis.  Assoc.     Jour.  19:  £31  -  453,  1924. 


70. 

of  the  qualitative  assumption, 

x  -  f (a,  b,  c  . . .  n) , 

than  does  the  multiple  linear  regression  equation. 

It  is  limited,  however,  in  that  it  assumes  that  an  adding  together  of 
the  several  functions  gives  a  best  representation  of  x,  and  in  this 
respect  may  suffer  from  the  same  inappropriateness  of  assumption  as  in 
the  case  of  multiple  linear  correlation. 

To  visualize  the  distinction  between  linear  and  curvilinear  re- 
gression curves  it  is  only  necessary  to  recall  the  graphic  method  of 
representing  n?t  regr;  zric~>  lines  discussed  in  the  initial  description 
of  multiple  net  regression.     Here,  the  values  of    x,  corrected  for 
contributions  cf  all  variables  except  th«  ^iven  independent,  a,  were 
the  orainatos  on  a  coordinate  dot  chart.    Tho  ^.'o^eissae  were  tho  values 
of  +b.e  given  indorjendent ,  a»    Suppose,  now,  that  the  distribution  of 
the:  do' -  ^  .va  this  chart  was  such  that  -it  were  apparent  that  a  bettor,  fit 
to  the  dobs  could  be  had  by  constructing  a  curve,  rather  than  a  straight 
line.    A  free-hand  curve  may  accordingly  be  substituted  for  the  net 
regression  line.    And  in  similar  maimer  curves  mny  be  substituted  for 
the  net  regression  lines  in  the  dot  charts  representing  the  net  relation 
of    x    to  the  other  independent  variables.    These  curves  then  iessrlbp 
the  functional  relation  of  each  independent  to  the  dependent. 

Since  the  first  step  in  the  process  of  determining  curvilinear 
net  regression  is  the  construction  of  these  net  dot  charts,  an  easy 
method  of  preparing"  these  dot  charts  may  be  described.    Using  for 
illustrative  purposes  the  four  variables,  x,  a,  lb,  c  which  are  deviations 


71 

from  averages  of  items  in  the  four  series,    X,  A,  3,  and  CT,  which  are  shown 
in  Sable  9,  the  first  step  is  to  determine  the  valiies  of  b    by  ordinary 
multiple  correlation  ire  thods  in  the  following: 

x    -  :  b  i  a'  -r    b^  b    *    £3    c   .........,,(80) 

x,  of  course,  "being  the  dependent. 

For  the  sake  of  illustration  an  arithmetic  exanrolo  is  given. 


•  72. 

Tabic  9.-  Data  for  illustration1-^/ 


Item 

;  A 

:  E 

:  C 

X 

Slim 

number 

1 

:  11 

;  10 

9 

14 

44 

2 

:  20 

i  19 

j      15  ; 

24 

S  78 

3 

i  6 

i  S 

0 

4 

:  16 

4 

:  6 

:  12 

6 

8 

!  32 

5 

:   ..  8 

:  8 

25     »  i 

;  58 

6 

:  9 

8 

8 

12 

:  37 

7 

1  It 

8 

8 

!  13 

!  40 

8  : 

14  : 

!       16  .  , 

:  16 

18 

;  64 

9 

16  ; 

10  : 

!       0  : 

9 

!  31 

10 

8  ! 

3 

8  ! 

',        11  i 

;  3d 

11  ; 

4  ! 

5  ! 

i     10  : 

11 

•  30 

12  j 

23  : 

26  ! 

26  : 

28 

;  103 

13  j 

14  : 

'  12 

10  : 

17 

!  53 

14  i 

10  i 

16  j 

14  ! 

14 

!  54 

15  : 

10  5 

10  ; 

15 

15 

!  50 

16 
•17 
18 
19 
20 


20 
12 
10 


13 
12 
2 
6 
20 


20 
12 
8 
5 
30 


26 
16 
21 
19 

27 


79 
52 
41 
46 
97 


Sums  :  244  :  227  ;  246  :  323  :  1,040 
Means     :  12.20      jll.35        :  12.30      :  16.15      :  52.00 

Extensions  -  preparing  Normal  Equations 12/ 


a  -  1  ! 

a  -  2  : 

a  -  3    ■  : 

3,504.0 
2,976.8  . 
527.2  : 

3,196.0 
2,769.4 
425. 5  : 

3,463.0 
3,001.2 
451.3 

4,515.0  :  14,673.0 
3,940.6  ;  12,688.0 
574.4  :  1,990.0 

in  i 

b  -  2  : 
b  -  3  : 

3,207.0  • 
2,576.5 
630.5  . 

3.373.0  . 

2.792.1  , 
530.9  . 

4,097.0  ;  13,873.0 
3,566.0  :  11,804.0 
431.0  :  2,069.0 

c  -  1 

c  -  2 
c  -  3 

4,296.0  : 
3,025.3  : 
1,270.2 

4,740.0  :  15,372.0 
3,972.9  :  12,792.0 
767.1  :  3,080.0 

x  -  1 

x  -  2 
x  -  3 

-  3,035.0  :  19,377.0 
:  5,216.5  :  16,796.0 
:      808.5  :  2,381.0 

11/  Taken  from  table  in  Mordecai  Ezekiol's  article  cited  pre- 
viously.    (Sec  footnote  3). 

12/  After  the  manner  shown  in  Use  of  Punched  Card  Equipment 
cited  previously.     (See  footnote  8). 


'      ■    '         -  73. 
Table  10.-  Normal  Equations  and  Solution  — / 


Equation                     Terms  in 

Absolute : 
term  : 

t  4 

Check        \  : 
sum  : 

II.        i.  ^  : 

|i    ;  4 

* 

I.  i  527.2 

II.  i 

III.  i  - 

:    426.6  461.8 

;     630.5  J  580.9 

•        ■•      :  1,270.2 
 ~-i —  _  _ 

574.4  : 

431.0  J 

767.1  ; 

1.9S0.0  :  ! 
2,069.0  :           ;  : 
3,080.0  :          .  : 

:  a 

!  .3323 
:  -.2194 

{     . 7596 

:  .  527.2 
:- 1.0000 

426.  &  s       461.8  i      574 .4  : 
-,8092  j      -8759     :  -1.0895  : 

1,990.0  ::  : 
•    -3.7746  :  ,! 

• 

4 

<t 

I  '" 

f  ' 

» 

fel             :  1.0895 

,    630.5  I        580.9  :      431.0  ! 
•  -345.2  :      -373.7  :    -454.8  j 

285.3  :       207.2  :    -  33.8  . 

-1.0000  i    -7263       f  ..  ±1185  , 

2,069.0  i  .! 
■    -1610.3  f 

458 . 7  : 

•  -1.6078  : 

:     1,270.2  :.      767.1  \ 
i      -404.5  :  -503.1 
:      -150.5  i  24.6 
!                     715.2  :  238.6 

3080.0  : 
;  .    -1743.0  i 
; ,      -333 . a  :  1 
!  '      1C03.8  5  b 

b„  -     ;:                  :  .4035 
-.1185  :  -.2931  -  -.4116 

+.3331      -.3534  *  :  1.0692 

:          757.1  :  309.5 
:  '       431.0  :-177.4 

;         574.4  :  614.1 

:        sS.  (d)  -  .9230 
R  2  .96 

Proof  of    b  ; 

III.  461.8b-,  +  580. 9b0  +-  :■  1270  767.1 
Coeff  xb       493.7       -    23,9.1       4-        512.5         r  767.1 


13/    After  the  manner  described  in  Use  of  Punched  Card  Equipment  cited 
previously.        (See  footnote  8). 


•  74 . 

Table  11. -Tabulation  of  residuals,  with  A,  B,  and  C 


;  A 

;      B.  ; 

C  ! 

zl 

-i  / 

z2i/ 

1 

,  ii 

io  : 

9  J 

-0.1 

+0.1 

d 

!  20 

!       ly  cjl 

15  ! 

+1.7 

+0.1 

r? 
6 

!           o  , 

!         6  : 

0      '.  : 

-2 . 7 

+0 . 1 

A 

!       12  : 

6  : 

-1-1.6  ! 

+2.5 

rr 

o 

5          .8    .  ' 

!        8  : 

■  26  .1 

-2.8 

-1.9 

D 

■  n 

i        y  . 

!        8  : 

'  8,     .  •  { 

-0.4 

+0 . 1 

f 

t          11  •. 

I         b  : 

8  J 

-1 . 6 

-1 .4 

o 

o         .  . 

;  14 

!       16  : 

16  ; 

+0.5 

+0 . 1 

y 

:  13 

!         10  ! 

0.  •! 

-2.4' 

-t»0 . 1 

10 

8  ; 

8'  :: 

8 !  ; 

y— ,  .  r7 

-0*3 

+0.1 

ii 

:  4 

f  ■  -    5  : 

10 

*1.9 

4*2 . 1 

1<S            .  . 

:       23.    '  • 

:       26  : 

.  .26    •  ! 

•+•1 , 1 

'  -0.4 

14     :  :.       12  : 

10     .  : 

+0.1 

^»0. 1 

1 A  . 
l'l 

10     •  '.{ 

!    .     16  :    i  J 

14  ; 

.+•1.8 

+1. 1 

ID 

:      10  ; 

10  :  ; 

-15  ;? 

-0.5 

-0.9 

16 

20 

j     .  13 

20  ! 

•  -1.2 

-2.4 

17 

12  :  '  , 

12  •  j 

12   :    :  \ 

V  +0.5  • 

+0.1 

18 

10  " 

:         2   •••{ 

8  ;  ' 

+•4.7 

:  +2.6 

19.  : 

16  "  % 

6  : 

5  ! 

-0.8  • 

-1.4 

20    '  ", 

20 

20  : 

.    30  ! 

-1.1 

i  -1.9 

2 

244 

227  ; 

246 

0  ' 

j  -1.1 

Means 

12.2  \ 

■  11.35  : 

12.3 

0 

Regrea  - 

sions 

1.1054 

;    -.4720  : 

.4179  ; 

S3  3fi 

8(Zl2)/n0? 

> 

.0758 

■^y  AT! p. 

£  1 

./1-.0758  r 

:  0.96 

1/  Brought  froto  table  12  for  use  in  obtaining  second 

aptjromimation  curves. 


'Table  12. -Readings  of  functional  relations  of     x    to  independents 
from  first  approximation  curves  (in  Figure  2) 


Observation 
number 

F  ( A) 

F(B) 

F(CV 

;  5(F)  ! 

 , — — .  _/ — 

3(F)  +  Yll  : 
=  x' 

X  : 

Zo  =  X+X1 ; 

2 

7  2 

1 

.  -1.5 

+0.5 

.  -1.0 

:  -2.0  . 

13.9  : 

14  : 

+0.1  : 

.01 

2 

,■9.5 

-3.5 

;  +2 . 0 

:  +0.0 

23.9  : 

24  : 

+0.1  : 

.01 

3 

.  ^7.5 

+3.0 

.  -7.5 

1-12.0 

3.9  : 

4  : 

+0.1  : 

.01 

4 

:  -7.5 

-0.5 

.  -2.5 

:-10.5 

5.5  ! 

8  : 

+2.5  2 

6 .25 

5 

-5.0  . 

+1.5 

.  +5.5 

:  -2,0 

17.9 

16  : 

-1,9  : 

3.51 

S 

:  -4.0 

+1,5 

-1.5 

:  -4.0 

11.9 

12  : 

+0.1  i 

.01 

7 

-1.5 

+1.5 

-1.5 

:  -1.5 

14.4 

13  : 

-1,4  : 

1.96 

8 

+3.0 

-2.0 

•  +2.0 

:  +2.0 

17.9  : 

18  : 

+0.1  •  : 

.01 

9 

0.0 

+0.5 

-7.5 

:  -7.0 

8.9  : 

9  : 

+0.1  i 

.01 

10 

-5.0 

+1.5 

-1.5 

:  -5.0 

10.9 

11  : 

+0.1  : 

.01 

11 

-10.0 

fp3 , 5 

-0.5 

:  -7.0 

8.9 

11  : 

+2,1  : 

4- 

13  j 

+13.0  : 

-6.0 

+5. 5 

:+12.5 

28.4 

28  . 

-0.4  ; 

.16 

13  : 

+2.0  : 

-0.5 

-0.5 

r  -H-1..0 

16.9  % 

17  • 

+0.1  .  . 

.01 

14  ; 

-2.5  i 

-3,0 

+1.5 

;  ~3.0 

12,9  i 

14 

*1.1 

1.21 

15  : 

-2,5  : 

+0,5 

+2.0 

:    0.0  I 

15 . 9 

15 

.  -0.9 

•  .81 

lb 

+9  ,  o  : 

-1 . 0 

'  +4 . 0 

-2  .4 

S  5.76 

1  "7 

0.0  : 

-0.5 

+0.5 

:  0.0 

15.9 

.  16 

!  +0.1 

■  .01 

18 

+6.5 

:  -1.5 

:  +2.5 

IS. 4 

:  21 

i  +2.6 

:  6.76 

19 

+4.5  ! 

+3,0 

:  -3.0 

:  +4.5 

:  20.4 

:  19 

:  -1.4 

:  1.96 

20  : 

+9  *  5  : 

-3.5 

:  +7.0 

;+13.0 

:  28.9 

:  37' 

:  -1.9 

;  3.61 

Sums 

+0.5 

*s=4.0 

:  +1.5 

! ,+S.O 

:      334 . 1 

:323 

:  -1.1 

:35.59 

Pa  .  ABC 


V1- 


of  7 

0\  ^r, 


h      V1     •*  3£-59 


s(x2) 


r  -  .043746 

s  .977 


>'5  -  5 


.956353 


Mx  -  is  [3(F)] 
n 


20 


=  15.9 


Fi.^are  2.-  Net  relation  of  >t  to  A,B,c-  C,  for  first  ivpproxi 


Effect 
on  x 

10 

5  ■ 
0 

-5 
-10 

-15 

10 

5 

0 
-5 
-10 

10 

6 

0 
-5 
-10 
-15 


0  10  20  30 

Values  of  independents 


'ec 
•n  : 

10 

5 

0 

•  5 
10 
15 

5 
0 

•  5 
10 
15 

5 
0 

•  5 

10 
15 


Net  relation  of  x  to  A,  B,  and  C,  for  second  appr oximatio: 


Net  relation  of 

x  to  A ' 


 i 


— t 


Net  relation  of 
x  to  B 


First  approximation  curve 


-4 


Second  approximation,  curve 


Net  relation  of 


x  to 


10 


20  30 
Values  of  independents 


40 


78 

The  values  of  b  having  been  found  by  multiple  linear  correlation  methods, 
the  dot  charts  may  be  constructed  on  the  basis  of  the  follovring  consideration 
Since  the  ordinates  for  the  chart  showing  the  relation  of    a    to    x    are  the  ! 
values  of    x    less'  th e ■ f unc t i 6 nal  t e rms    \  b    and    b    cs  i.e., 

(Ji  -  x  '  4  °  ~  h  c)* 

the  ordinates  may  also  be  defined  as  the  residuals  plus  the  contributions  of 
a,  bx  ^    i.  e.„  1 

j    -    bj  a  4  z   C'81) 

For  z    =    x  -  b^  a  -  b^  b  -  b„  c  % 

z  4  ba        -    x  -  bg  b  «;  bv  c 

5  j,i  definition. 

Thus,  to  construct  the  graph  it  is  only  necessary  to 

(1)  Find  the  values  of  z 

(2)  Graph  the  regression  of  x  on  a{«  b , ) 

(3)  Plot  the  residuals  as  ordinate  deviations  from  the  re- 

gression lino  with  abscissae  the  associated  values  of  a. 

The  ordinate  value  of  the  regression  line  truces  care  of  the  term 
J±      a    in    (31)     for  any  given  value  of    a,  and  it  is  therefore  only 
necessary  to  odd  the  value  of  z  to  locate  the  point. 

In  a  similar  manner,  to  make  the  dot  chart  showing  the  net  relation 
of    b    to    x    it  is  only  necessary  to  plot  the  residuals  as  ordinate  de- 
viations from  the  regression  of    x    on    b    with  abscissae  the  associated 
values  of    b4     And  likewise  for  c. 


73 

The  advantage  of  this  method  of  , constructing  the  dot  charts  is 
that  it  saves  labor.     If    j  •  were  determined  repeatedly  for  each  ce.sefl 
the  three  following  series  would  have  to  be  secured s 

i-i  x  z  "  -2  ^  ~  k  g  c 

jg    -    x    ■•*    .£  ^   a    "    Id  2  P 

^2  x    —    —"1    ^    *"*    —  2  ^ 

which  repeats  three  times  the  processes  involved  in  securing 
2    -    3C    -    b-j^a-    bpt    -  —5C 
The  residuals  are  shown  in  table  II.     These  residuals  were  found 
by  simplifying  the  regression  equation  interns  of  original  values, 

X    -    %  »   hi  &  ~  *0  *         <B  ~  V    *    H  (c  "  Mc  J 

to  read 

X    -    K    4    b^  A    *    b2    £    $    b3    CT   (82) 

by  merely  collecting  the  constant  terms  to  give  K8 
f    g    1%    -    Hi  Ma  ~    b2  Mb    -    b  3  Mc 
and  then  subtracting  the  evaluation  of  this  equation  from  associat 

values  of    X.     Since  this  is  a  comparatively  simple  process  the  tables 
showing  the  arithmetic  -re  omitted. 

As  a  check  on  the  computation  of  Rv, abc    and  of    z,     the  standard 
deviation  of  the  residuals  may  be  determined  and  used  in  the  formula 

s  =  J~i  -  or.  z~ 


-  it 


re2 


The  arithmetic  is  shown  at  the  bottom  of  table  11; 


80. 

With  the  residuals  once  determined,  the  charts  are  laid  out,  with 
regression  lilies  of  slope,  b,  and  the  dots  plotted  in  the  manner  described 
These'  charts,  for  the  example,  are  shown  in  figure  2. 

Notice  that  although  the  X  scale  is  in  terms  of  deviation  from 
average,  the  abscissae  scales  arc  in  terms  of  original  values.    This  is 
merely  a  convenience.     In  constructing  the  graphs  the  abscissae  scales 
might  alsc  have  been  constructed  as  deviations  from  average,  but  this 
would  involve  ascertaining  the  deviation  of  any  given    A    value  before 
plotting  its  associated    z    as  a  deviation  from  the  regression  line. 
The  dots  are  located  in  precisely  the  same  spots  as  if  this  process  nad 
boon  gone  through  with,  provided  only  that  the  regression  line  is  drawn 
to  pass  through  the  point  representing  the  intersection  of  the  mean  of  the 
independent  and  the  zero  value  of    X    (the  latter  scale  only  being  con- 
structed to  show  deviations  from  average.)    This  point  is  the  "mean 
of  the  distribution." 

It  is  advisable  in  this  connection  to  point  out  that  once  the 
regression  lines  have  been  graphed,  these  graphs  may  b.2  used  as  a  means 
of  computing  the  estimates  "of    X.     Thus,  for  any  given  value  of    A,  to 
determine  the  contribution  of    A    to  the  estimate  of    x    it  is  only 
necessary  to  read  the  ordinate  of  the  regression  line  at  the  abscissa 
corresponding  to  the  given  value  of    A.    This  shows  the  net  deviation 
from  average  in    X    accompanying  the  given  value  of    A.     In  like  manner 
the  contributions  cf    B  and    C    to  the  estimate  of    X    may  be  secured 
from  the  appropriate  graphs.     Summing  these  three  readings  together 
gives  the  aggregate  deviation  from  average  in    X    that  may  he  attributed 
to  the  independent  variables.    To  transform  the  total  or  aggregate 


81. 

deviation  from  avera.y:©  to  an  absolute  value  it  is  obviously  only  necessary 
to  add  the  average  of    X.     In  passing,  it  might  be  noted  that  with  this 
last,  addition  an  estimate  of    X,  X'  ,  is  secured.    The  value    Z    is  secured 
by,,,  subtracting    x«  from  X. 

This  represents  a  somewhat  clumsy  manner  of  securing  estimates  of 
X,  and  residuals,  when  we  are  dealing  with  linear,  multiple  correlation. 
Nevertheless,  if,  instead  of  straight  lines,  the  net  regression  lines 
were  free-hand  curves,  it  would  be  practically  the  only  way  of  securing 
estimates  of    X.    And,  indeed,  this  is  the  method  used  as  soon  as  we 
depart  from  the  representation  of  the  net  relation  of  an  independent 
variable  to  the  dependent  by  other  than  a  straight  line. 

This  departure  is  made  forthwith  by  examining  the  dot  charts 
(figure  2)  and  observing  that  in  two    of  the  cases,  a  curved  line  would 
give  a  better  approximation  to  the  dots,  "shewing  the  net  relations  of 
X    to    B    and    C    than  do  the  straight  "lines.     Calling  upon  his  judgment, 
and  remembering  that  the  functional  relation  is  to  be  expressed  by  a 
smoothed  curve  the  operator  next  draws  in  these  smoothed  curves,  shown  by 
the  broken,  lines.     The  straight  lines  are  then  superseded  by  the  curves 
(broken  lines)  as  representations  of  the  relation  subsisting  between  x 
and    1    and    G.     For  the  present  the  relation  of    X    to    A  remains  re- 
presented by  the  straight  line.     The  new  curves  are  called  the  "first 
approximations". 

.It  next  becomes  desirable  to  obtain  some  measure  both  of  how  well 
these  first  approximations  have  been  drawn,  and  of  how  completely  they 
explain  the  variation  in    X.     This  is  to  be  accomplished  by  securing 
estimates  of    X    on  the  basis  of  these  curves,  rather  than  on  the  basis 


82 

of  the  first  straight  lines,  raid  corrcl  ting  the  estimates  with  the  actual 
X  value?.     The  estimates  may  be  secured  as  explained  in  the  fourth 
preceding  paragraph.     It  is  only  necessary  to  rea&  the  functions  of  the 
independents  from  the  curves  for  vines  of  A,  B,  end  C    associated  with 
°ny  given    X    value,  and  sum  them. 

The  necessary  'readirgs  and  sums  are  given  in  table  12.     In  order 
to  use  these  sums  of  functions  as  estimates  of    X    such  a  constant  should 
be' "ddod  to  them  as  would  make  the  average  of  estimates,    M-/1,  equivalent 
tc  the  average  of  the  actuals,    %ir'm    This 'constant  of  course,  ©as  be 
ascertained  by  t -king  the  difference  between  the  aver- go  of  the  sums  of 
functions  and  the  average  of    X;  thus 

K  te    Mx  -(l/n)-S  [3(F)]    (33) 

and  the  new  eauation  for  e^tiniating    X    may  be  aritton 

X1     s    K  4  FX  r(A)^F2'   (3)    *    F3*  (C)    (84) 

Correlating  X1  with  X  gives  the  multiple  correlation  index,  ^jq- 
If  the  curves  are  better  representations  of  the  relationship  than  the 
straight  lines,  the  correlation  index  should  be  higher  than  the  correlation 
coefficient.    The  correlation  index  is  conveniently  computed  from  new 
r^sidu'ils,  because:  as  we  shall  observe  later,  it  is  desirable  to  compute 
these  new  residuals.     The  new  residuals  m-y,  of  course,  be  computed  by 
subtracting  the  estimates  from  the  "actuals",  i.e. 

7.„  _     v  _  vi 

The  computation  of  these  residuals  is  shown  in  t  able  12.    The  multiple 

correlation  index  is  then  found  by  the  formula. 

£X2ABC    =    1  -  S{%3  j    (35) 

S.\X"  ) 

-»3aafih  is  recognizable  as  precisely  analogous  to  the  familiar  formula  (56) . 


In  considering  the  function  curves  it  comes  to  mind  that  if  there 
is  a  high  degree  of  correlation  "between  any  two  of  the  independent  variables, 
the  process  of  drawing  smoothed  curves  to  pass  through  as  many  of  the  dots 
as  possible  may  "be  overdone,  for  if  there  is  a  grouping  of  positive  residuals 
for  one  independent  causing  the  operator  to  draw  a  curve  so  as  to  pass 
through  that  group,  there  is  apt  to  he  the  same  -groaning  of  positive 
residuals  for  the  closely  related  second ri.ndepend.ent1  causing  -the  operator 
to,  pass  a  curve  through.. that-. group,  again.    Thus,  the  same  deviation  may' 
be  doubly  accounted  for  by,  being  attributed  to  the  tvro  independents.  In 
order  to  test  for  this  case,  second  approximations  maybe  secured  in  a 
manner  identical  to  the  securing  of  the  first  approximations.     Thus,  new 
dot  charts  are  to  he  constructed  in  which 'the  ordinates  are  the  values  of 
X    corrected  for  the  influence  of  all  independents  (as  dafined  by  the 
constructed  curves)  except  ..the  given  one,  and  the  abscissae  the  associated 
values  of  the  given  independent.     If  the  first  approximation  curve  is 
entirely  satisfactory  the  .dots,  will  group  themselves  closely  around  the  • 
superimposed  first  approximation  curve  showing  the  relation  of  the  dependent 
to  the  given  independent* 

Just  as  there  was.  simplification  in  the  process  of  making  the  first 
dot  charts  by  plotting- residuals  as  ordinate  deviations  from  the  regression 
linos  for  associated. .values  of  the  given  independent,  it  is  likewise  desirable 
to  construct  the  charts  in  this  case  by  plotting  the  new,  or  "second"  resi- 
duals ,  Z  ,  as  ordinate  deviations  from  the  first  .approximation  regression 
curves .  To  accomplish  this  it  is  only  necessary  to  reconstruct.1  the '  first  ap- 
proximation curves  on  new  graphs  and  proceed  to  plot  the . residuals  as  given 
originally  in  Table  12,  and.  secondarily  in  Table  11,  since  they  are  in  ■ 


84. 

the  latter  table  conjoined  with  the  values  of  the  independents  necessary 
to  have  in  plotting  the  dots.  The  new  dot  charts  described  are  shown  in 
Figure  3. 

These  show  that  further  improvement  in  the  curves  representing  the 
relationship  of    X    to  the  various  independents  nay  be  had  by  modifying 
the  expressed  relationships  somewhat',   as  illustrated  by  the  broken  curves. 
The  new  curves  (broken  lines)  may  then  be  taken  as  the  second  approximations  J 

If  it  is  felt  that  there  still  remains  room  for  improvement,  third, 
fourth,  fifth  and  more  approximations  may  be  made.     In  actual  practice 
it  is  not  unusual  for  eight  or  ten  approximations  to  be  made  before  the 
investigator  is  satisfied  that  his  final  curves  represent  the  best  possible 
expression  of  relationships  between  the  dependent  and  the  independents. 
As  long  as  it  is  possible  by  further  approximations  to  raise  the  value 
of  rho  (p) »  continued  approximations  arc  justified.    When  rho  can  no 
longer  be  increased  by  changing  the  curves  it  indicates  the  futility  of 
further  approximations . 

The  method  of  multiple  curvilinear  correlation  may  be  summarized 
as  follows:    By  the  usual  methods  of  multiple  correlation  the  net  re- 
gression lines  showing  the  net  effect  of  each  factor  upon  the  dependent 
are  -clotted.     The  values  of  the  dependent  as  obtained  from  the  regression 
equation  are  determined,  as  are  the  residuals.     These  residuals  are  in 
turn  plotted  against  each  independent  factor  as  deviations  from  each  of 
the  regression  lines,  the  lines  then  being  curved  to  nass  through  the  plotted 
points  in  so  far  as  consistent  with  the  hypothesis  of  a  "smooth  curve" 
function.     From  these  curves  the  dependents  are  again  estimated  by  reading 


from  the  curve  the  dependent  values  associated  With  each  independent. 
Hew  residuals^ Q.T6-' obtained  and  plotted  as  filiations  from  the  regression 
curves,  and  the  process  is  continued  'until  the  residuals  can  not  be  reduced 
further.  , 

It  is  seen  that  this  method  is  one.  cf  approximation ,  and  as  such  is 
"zis'i,  fy.w^Dtible  to  the  mathematical  demonstration  of  validity  and  pro- 

babilit^^'i  are  many  other  statistical  methods-     On  the  other  hand,  as 

M 

by  the  closeness  with  which  the  dependent  values  may  be  estimated 

>i 

'from,  the  independent  factors  and  by  empirical  tests  ,  the  method  is  con- 
j 

sideraoly  superior  to  ordinary  linear  multiple  correlation. 

In  studying  the  method  cf  multiple  curvilinear  correlation,  the 
investigator  might  observe  that  the  first  determination  of  the  linear 
regression  lines  is  not  required  of  necessity.     The  process  of  approx- 
imation might  be  commenced  with  the- dot  charts  showing  the  gross  re- 
lation of  the  dependent  to  the  various  independents.     But  if  this  were 
done  the  investigator  would  have  to  contend  with  the  errors  introduced 
by  intercorrelation  amongst  the  independents,  cited  in  a  previous  paragraph, 
to  an  even  greater  degree  than  in  the  method  as  presented.     For  if  the 
weighted  average  slope  of  curves  are  'first  determined  by  methods  of  linear 
multiple  correlation,  the  effect  of  intercorrelation  amongst  the  independents 
is  eliminated  except  in  so  far  as  deviations^  from  linearity  are  ccr.rela.ted, 
It  is,  therefore,  always  advisable  tp  let  the  initial  approximation  to 

f 

the  functional  relations  be  the  net  'regression  lines. 

The  apportionment  of  variability  in  the  dependent  to  the  various 
"independent  factors  may  be  accomplished  in  much  the  same  manner  as  that 
described    for  multiple  linear  correlation.     This  may  be  accomplished 
by  taking  the  functions  of  independents  read  from  the  final  curves  as 


86. 

the  independent  variables  rather  than  the  actual  variables  themselves  and 
correlating'  with  'the  dependent.    Then  if  product  moments  and  standard 
deviations  be  commuted,  end  net  regression  coefficients  of  X  on  the 
functions  of  independents.,  by  the  usual  methods  of  multiple  correlation, 
•2II  the  figures  nocessary  tn  the  computation  of  coefficients  of  determin- 
ation and  part  correlate. or.  will  be  available.     Coefficients  of  partial 
correlation,  however,  would  be  very  difficult  to  interpret  if  computed  from 
thuse  series . 

...  :.  _ Z L_ .*l *£  be  noted  that  the  net  1  egressions  of  the  dependent  on 
the  functions  of  the  independents  should  all  be  1.0  since  the  function 
curves  "we're  so  cenc'^ructed  that  by  adding  together  readings  "from  them 
(with  weights  of  1.0)  the  dependent  could  be  estimated  .•    *f  a  given  net 
regression  coefficient  should  prove  to  have  a  value  ether  than  1.0,  the 
gi-ren  function  curve  should  be  modified  or  "tilted"  so  that  the  value:  of 
b  would  come  out  1.0.     Thus  if  the  value -of  b  should  prove  to  be     .8  j 
the  curve  would  ha>'e  to  be  changed  so  that,  for  any  given  abscissae  the 
ordinate  value  would  be     .8  of  what  it  would  formerly. .  This,  then,  ; 
wou.'d.  cause  the  net  regression,  b,  to  be  1.0.  u 

The  curvilinear  correlation  methods  which  have  been  described  have 
enabled  us  to  determine  the  functional  relations  in  a  formula  *of  the  type 
■    X  =  K  +  P1  (A)  +  |g  <b)  +   ,\  .•  +  W  (n) 
Methods- have  also  teen  designed -i^/  .to  enable  us  to  determine  the 
functions  in  relations  of  the  following  type 

;  f  '  *W  -  ^  ♦       'UY+  J2  (B)  ♦   .  .'.\4  Fn  (n)  (86). 

This,  of  course,  is  equivalent  to 

.  .  *  -  F0[X:+  ti  (A)  V'f2  (E).  ■  ■f,\:-.  v.P^  (n)j   :.  .-.^••1 

and  it  is  in  this  latter  form  that  F    is  detertnined. 


87. 


14/  Bruce,  Donald.  "On  Possible  Modifications'  in  the  Ezekiel  Method  for 
Handling  Curvilinear  Multiple  Correlation"  MS  filed  in  Library  of  the  U. 
S.  Dept.  of  Agriculture. 

Formula    (86)  enables  us  to  reduce  somewhat  the  rigidity  of  the  im- 
plicit assumption  inherent  in  formula  (79),  and  thus  represents  an  even 
further  step  towards ' analytical  methods  free  from  such  assumptions.  Thus 
where     (79)    propounds  that    X    is  a  sum  of  functions  of  the  independents, 
(86)  propounds  that  some  function  of    X    is  a  sum  of  functions  of  independents. 

The  method  of  determining    Fq  in  formula    (87)  is- quite  simple. 
In  addition  to  graphing  the  residuals  against  each  of  the  independents  as 
ordinate  deviations  from  the  net  regression  lines,  the  residuals  are  also 
graphed  against  the  sum  of  the  functions  read  from  the  curves  plus  the 
constant  term,  i.e.  against    X',  as  an  ordinate  deviation  from  a  forty-five 
degree  line  on  a  coordinate  chart  in  which  the  abscissae  and  ordinate  scales 
are  identical.    The    X1  values  are  taken  as  abscissae.  .The  ordinates 
then  represent    F0(X'),  and,  of  course,  for  the  purpose  of -first  approx- 
imating   Jp      the  relation  between  X'  and    X,  or'   X'     and  F  (X1 )  is  taken 
as  a  "one-to-one"  relation;  hence  the  forty-five  degree  slope.     But  in 
later  approximations t  just  as  the  residuals  are  plotted  as  deviations  from 
curves  determined  by  preceding  -oroccsses,  so  also  they  are  plotted  from 

the  curve  determined  in  the  •areca&ing  -orocesc  for    F  .    Hew  residuals 

1 1  ■ ...    ..  -  Y  t  o 

are  of  course  found  by  subs tr acting    F^X')  from    X,  for  in  reality,  Fp(X') 
is  the  true  estimate  of    X,  rather  than    X'  alone. 


88. 

VI  j I     Joint  Relationships . 

15/ 

X!e  have  seen  how  by  methods  of  correlation  certain  specific  cases 
of  the  general  theorem,  that  the  dependent  variable  Is  sonic  function  of 
the  independents,  may  ho  tested  for.    These  methods  have  taken  care  of 
linear  and  non  linear  addative  functions.    It  is  now  proposed  to  show  how 
with  a  limited  number  of  variables  and  a  largo  number  of  observations  it  is 
possible  to  completely  eliminate  the  limitations  of  the  above  cited  methods. 
In  short  it  is  proposed  to  define  the  function  An  the  following  equation, 
based  solely  upon  the  data  themselves  and 'with  no  suppositions  as  to 
addative  or  other  types  of  relationships  and  provided  only  that  whatever 
the  function  be  it  changes  systematically  with  changing  values  of  the 
independents,  i.e.  it  'is  &  "smoothed"  function. 

X  r  F(4»l3   (88) 

As  in  the  case  of  curvilinear  multiple  correlation,  the  methods 
employed  are  approximation  methods  and  are  thus  not  subject  to  the  same 
rigid  mathematical  demonstration  of  validity  that  oiher  methods  are. 
Nevertheless  it  is  possible  to  adapt  certain  of  the  measures  of  relationship 
to  this  case  and  thus  obtain  measures  of  agreement .    The  nature  of  the 
relationships  discovered  can  best  be  represented  graphically. 

The  appropriateness  of  methods  for  determining    ¥    in  (88)  as 

opposed  to  the  other  types  of  relationships  discussed  may  first  be  considered. 

Suppose  that    Z    represented  the  yield  per  acre  of  a  crop,  A  the  rainfall 

during  the  growing  season,  and    3    the  temperature.    Then  a  given  deviation 

in    A  from  its  normal  would  have  some  effect  upon    X.    But  this  effect 

15/  j"or  original  treatment  of  joint  relationships  in  multiple  correlations 
see  SzeJciti,  Mordecai.    Determination  of  Correlation  "Surfaces"  in  the  Pres- 
ence of  Other  Variables.    MS.  submitted  to  A'aer.  Statis.  Assoc. 


89. 

oxi    X  wpiild  "be  different  according  to  whether  or  net  the  temperature  were 
high  or  low.     Tims  with  warm  weather  a  given  increase  in  rainfall  might 
easily  have  a  more  pronounced  effect  on  yield  than  the  same  increase  in 
rainfall  might  have  with  cold  weather.     It  thus  "becomes  apparent  that 
the  discovery  of  the  relationships  as  represented  by    F    in  the  following, 

F(X)     *    F(A)      .  +  F(B) 
is  inadequate,  for    F(A)    changes  with  values  of    B.    We  have  here  not 
an  addit  ive  relationship  out  a    ,j oint    relationship .     This  type  of  relation- 
ship occurs  in  many  cases . 

The  method  of  determining    F    in  (83)     is  to  construct  a  three 
dimensional  diagram,  in  which  the  depths  are  the  values  of    A,  the  widths 
the  values  of    3    and  the  heights  the  values  of    X.    A  smoothed  surface 
is  then  constructed  to  he  as  re-ore  sent  at  ive  of  the  plotted  points  as 
consistent  with  the  hypothesis  of  a  smooched  surface.    To  do  this  smoothing 
graphically  requires  considerable  skill  and  not  a  little  patience.  To 
assist  the  process  in  three  variable  problems  a  machine  has  been  constructed. 
This  machine  is  a  board  through  which  holes  are  bored  in  a  checkerboard 
pattern.     Through  these  numerous  holes  are  passed  rods  which  may  be  slipped 
through  them  to  any  given  length,     One  dimension  along  the  board  is  taken 
to  represent  values  of  one  independent,  the  other  dimension  the  other 
independent.    The  rods  are  pushed  through  the  holes  so  that  the  length 
they  protrude  is  proportional  to  the  average  value  of  the  dependent  for 
the  associated  values  of  the  independents,  as  indicated  by  the  location 
of  the  particular  hole  on  the  board.    A  diagram  which  will  help  to  visualize 
the  above  described  machine  may  be  seen  on  -cage  246  of  G.  U.  Yule's  Introduc- 
tion to  the  Theory  of  Statistics  (Sixth  Edition,  1922). 


90. 

If  in  the  process  of  smoothing,  it  is  decided  that  a  flat  surface — 
a  surface  with  slopes  in  two  directions,  such  as  might  "be  illustrated  by 
tipping  a  piece  of  board  in  two  directions — is  the  most  appropriate, 
these  two  slopes  are  nothing  but  the  net  regressions  of    X    on  the  two 
independents,  for  manifestly,  the  position  of  such  a  "flat"  slope  is  the 
plane  completely  defined  by  two  straight  lines  intersecting  at  right 
angles).    This  is  the  plane  discussed  by  Yule  in  connection  with  the 
above  cited  diagram. 

But,  if  curved  slopes  are  to  be  introduced  of  a  nature  which  might 
be  illustrated  by  a  section  of  a  trough  or  flume,  or  a  sectlion  of  pipe 
divided  longitudinally,  and  tilted  so  as  to  have  specified  average  slopes 
with  reference  to  a  base,  then  the  curvilinear  correlation  methods  are 
appropriate. 

But  if,  finally,  the  slopes  are  to  he  illustrated  as  in  the  pre- 
ceding paragraph  except  that'  the  trough  is  twisted,  as  might  be  accomplish 
ed  by  holding . one  end  steady  while  the  other  end  were  twisted  several 
degrees  around  the  longitudinal  axis,  then  it  is  essential  that  the  ,|cmt 
function  be^ determined.     It  is  impossible  t*  define  the  slopes  in  the 
two  directions  independently,  for  these  slopes  change  as  we  pass  from 
one  edge  of  the  plane  to  another.   ■  " 

In  the  machine  for  assisting  in  the  determination  of  these  elopes, 
the  ends  of  adjacent  protruding  rods  are  connected  by  threads,  which  then, 
to  a  degree,  represent  the  unsmoothed  surface ' indicative  of  the  relation 
of  the  heights  to  the  two  "base-board  measurements.    A  different  colored 
thread  is  then  connected  to  the  shanks  of  the  rods  in  such  a  manner  as 
to  represent  a  more  generalized  or  smoothed  surface.     The  heights  of 


91. 

this  surface  above  the  base-board  are  then  taken  as  the  effect  on    X  of 
the  values  of  the  two  independents  indicated  "by  the  location  of  the  rod 
on  the  "baseboard.     The-  surface  may  be  recorded  by  constructing  a  table 
somewhat  similar  to- a  double  frequency  table.     The  captions  may  be  the 
values  of    A,  the  stubs  the  values  of    E,  the  body  of  the  table  then  con- 
tains the  heights  of  the  surface  for  the  related  values  of    A    and  B. 

The  measure  of  correlation  may  be  obtained  by  securing  the  residuals- 
differences  between  the  heights  of  the  surface  and  of  actual  observations 
of    X    for  associated  values  of    A    and    B — and  using  the  standard  de- 
viations of  residuals  and  of    X  to  obtain  the  correlation  index  by  formulae 
described  previously. 

If  there  are  three  instead  of  two  independents  in  formula  (38) 
the  determination  of  the  joint  relationship  is  further  complicated.  It 
becomes  necessary  first  tc  determine  the  surface  showing  the  relation 
of    A    and      B    to    X    for  a  given  value  of    C.     Another  surface  is  then 
determined  showing  the  relation  of    A    and    3    to    X    for  the  next  value 
of    C;  yet  another  surface  is  then  determined  for  the  next  value  of  C — 
and  so  on.    The  resulting  surfaces  may  be  visualized  by  imagining  the 
roofs  on  a  row  of  houses,  each  roof  nrogressively  differing  from  the  pre- 
ceding with  changing  values  of    C.     In  short,  we  have  here  a  three 
dimensional  figure  taken  at  points  as  it  has  moved  through  space,  with 
one  of  the  dimensions — the  heights — systematically  changing  as  it  moved. 
The  lines  traced  by  given  points  on  the  surface  (defined  by  their  vertical 
position  above  given  points  on  the  base  of  the  figure),,  as  the  figure 
moves  through  space,  should  represent  smoothed  curves. 

In  like  manner  a  representation  of  four  independent  variables  to 


92. 

the  dependent  may  be  had  "by  imagining  that  the  whole  row  if  hau.se s  be  moved 
progressively  sideways  with  changing  valu.es  of  the  fourth  variable,  until 
we  have  a  solid  block  of  roofs.    The  whole  block  can  then  be  moved  in  two 
directions  to  represent  changing  values  of  two  more  independents,  and  so 
on.    These  representations  of  joint  relationships  are  of  more  use  as 
concepts  than  they  are  as  methods.     It  is  generally  impractical  and  often 
impossible  to  work  out  such  joint  relationships  involving  more  than  two 
independent  variables. 

On  the  other  hand,  Mr.  Izekiel  has  developed  a  method  whereby  it 
is  possible  to  discover  such  jo:nt  relationships  of  two  variables  to  the 
dependent,  in  the  presence  cf  other  independents.     In  short ,  by  his  method 
it  is  possible  to  determine    F    in  the  following: 

X  =  F0  (a,B)    *    »3  (C)  *  ...  v    Pn  -(h)  :  .'  

By  this  method  3?'    values  are  first  secured  in  the  following  by  curvilinear 
methods : 

X  =  F1»  (A)  +  f|     (B)  4  V  ...  h      V  <n> 

When  the  values  of    F!    have  bean  secured,  a  surface  is  constructed,  either 
graphically  or  by  means  of  tilfe  model  or  machine  described  previously,  in 
which  the  slope  with  reference  to  the    A    dimension  is  isad*  eqiriv&lont  to 
F^' ,  and  with  reference  to  the  3  dimension,  f*  .    The  surface  is,  of  course, 
a  curved  surface,  but  cross  sections  of  the  surface  taken  through  any  values 
of  either  independent  are  similar  ,in  form  irrespective  of  the  values  of  the 
other  independent . 

The  residuals  are  next  plotted  (or  represented)  as  altitude  devia- 
tions from  the  surface  at  points  of  the  surface  located  vertically  above 
the  intersection  of  associated    A    and    B    values  on  the  base.    The  surface 
is  then  waroed  so  as  to  give  the  best  possible  representation  of  the 


location  of  residuals. 

It  will  "be  scon  that  by  this  method  the  average  slopes  of  the  turf,;  e  • 
are  first  determined  by  means  of  curvilinear  methods,  which  then  serves  to 
show  the  effect  of    A    and    B    on    X,  in  so  far  as  these  effects  are  in- 
dependent of  their  joint  relationship.    The  warping  of  the  surface  then 
provides  for  any  additional  effects  due  to  the  .joint  effect  of  the  two. 

Following  the  determination  of  this  surface  it  is  advisable  to  re- 
compute residuals  and  again  plot  these  residuals  as  deviations  from  the 
several  ascribed  functions  to  ascertain  whether  or  not  further  slight  modif? 
cations  should  be  made  in  the  functions.     This  approximation  process 
parallels  that  described  for  curvilinear  correlation,  except  that  in  the 
case  of  variables    A    and    B    a  surface  rather  than  a  curve  is  to  be 
smoothed.     The  final  resulting  curves  and  surface  are  than  considered  to  be 
F    in    (89),  FQ  being  a  siirface  and    F^  ....  Fn  being  curves.    Needless  to 
say,  several  pairings  of  variables  can  be  carried  on  simultaneously  in  the 
same  analysis 8  if  desired.     The  factors    C    and    D,  for  examnle  may  be 
paired  in  a  manner  similar  to    A    and  B. 

The  measiire  of  correlation;  as  in  previous  cases  is  secured  by  insert' 
 o 

ing  the  in 

p   « 

X.  (AB)GD  .  .  -N  =  1  -  0/J 

Coefficients  of  part  correlation  and  determination  are  computed  by  correlat- 
ing the  readings  from  the  curves  and  surface  with    X  as  described  for  curvil- 
inear correlation,     A    and    B,  of  course,  being  considered  as  one  since 
their  combined  effect  is  included  by  using  the  readings  from  the  surface 
as  one  of  the  independents. 


94. 

VII I    Application  to  Time  Series 
A  time  series  is  a  series  of  measurements  which  have  "been  made 
..on  a  given  phenomenon  at  different  (usually  equidistant)  points  in  time, 
harry  series  used  in  economic  research,  such  as  average  monthly  prices, 
are  time  series.    Many  time  series  have  certain  peculiar  characteristics 
which,  require  special  technique'  f or  their  statistical  description,  and 
enjoin  caution  in  the  application  of  analytical  methods  in  the  delineation 
o,f .  ted  at  i  onsh  ip  s . 

In  general,  many  time  series  are  characterised  "by  a  "trend"  move- 
ment .     ±bus  rh.ers  is  a  gradually  increasing  production  of  most  commodities, 
paralleling  increasing  population,  end  diminishing  costs  of  production. 
Some  series  show  ^  d.;v,n.-ard  trend,,  such  as  the  manufacture  of  horse-drawn 
pleasure  vehicles,  timber  resources 3  percentage  of  child  mortality,  etc. 
Licst  such  trend  movement s  are  "basically  attributable  to  a  gradually  evolv- 
ing civilisation  or  environment  and  its  multitudinous  manifestations. 
In  studying,  then,  the  relation  of  the  price  of  a  commodity  to  various 
factors  ever  a  u^riod  of  time,  it  is  well  nigh  requisite  that  some  account 
be  taken  of  the  influence  of  chese  "multitudinous  manifestations."  But 
it  is  obviously  impossible  to  secure  statistical  measurements  of  all  these 
influences,  and  thus  obviously  impossible  to  include  direct  measurements 
of  them  in  the  analysis  of  the  factors  influencing  the  -price  of  the 
commodity ,  over  the  period  of  time.     If,  however,  we  ma&e  a  certain 
assumpticns  this  difficulty  may  in  an  indirect  way  "be  surmounted.  This 
basic  assumption  is  that,  since  we  are  in  an  environment  gradually  evolving 

as  we  pass  through  time,  then  the  influence  of  this  environment  may  be 

,   •;    •  . 


95. 


considered  statistically  as  some  function  of  a  numerical  description 
of  the  passage  of  that  time.     Of  course,  to  the  degree  that  we  are  able 
to  obtain  direct  measurements  on  significant,  related  factors,  such  as 
supply,  consumption,  costs  of  production,  etc. ,  to  that  extent  the  number 
of  influences  "In  the  environment  which  must  be  thrown  together  and 
empirically  measured  statistically  by  the  passage  of  time  is  diminished. 
In  short  the  basic  assumption  is  that  the  aggregate  influence  of  such 
otherwise  unmeasured  factors  as  develop  gradually  or  recurringly  may  be  taken 
as  some  function -of  a  numerical  description  of  the  passage  of  time. 

If  we  have  no  measures  of  related  factors,  but  only  the  price 
series  under  consideration,  then  the  relation  of  the  price  to  all  related 
factors,  as  measured  under  the  blanket  assumption  that  their  effect  is 
some  function  of  time,  may  be  determined  by  finding  the  regression  of 
price  on  time  as  described  for  gross  correlation.    This  regression  line, 
in  time  series  analysis,  is  customarily  styled  the  (linear)  "trend"  of 
the  series.    Since' the  numerical  measurement  of  time  is  purely  arbitrary 
it  is  convenient  to  let  its  value  for  the  middle  observation  be  zero,  and 
numbering  succeeding  observations,  progressively  1,  2,  3,  etc.,  and  pre- 
ceding observations  progressively  -1,  -2,  -3,  etc.     This  simplifies 
the  computation  cf  the  regression  since  the  average,  M  ,  of  the  time 
measurements,  T,  will  be  zero.    The  product  moment,  pxt,    is  then 


Pxt  5  l-S(XT) 


(91) 


n 


The  standard  deviation  squared  of    T  is 


°i    =  l.S(T2) 


(92) 


n 


96. 


And  the  regression. 


of  course , 


t 


(S3) 


In  time  series  -  analyses  it  is  customary  to  call  the  regression  of  X  on  T 
the  annual  or  monthly  '"  increment '!  in  the  trond  according  to  whether  the 


Tables''  have  Veen  constructed  (See  Mills  and  I'avenport  "Pro"blems  . 
and  Tables'-  in.  Statistics!')  so  that  -the.  value  :ef    S(T2)  may  be  read  from 
the  maximum  value  of    T    in  the  series,.        ...  ;-,  •  - 

If    T  ..  has  values  as  stiggested,     S(T)    will  be  zero  only  ,  when 
there  are  an  'odd; -'number  of  observations.    When  .there  are  an  even  number 
of  observations »    S(T)    may  be  made  /equal  to  zero,  by  assigning -values 
1,  o;  5,  7,-  W-t  etc,  going  forward ••  through  the  last  half  of  the  observation 
and    -1,         -.5,  -7,  .-9,  etc.  going  b acinar d  through  the  first  half  of 
the  observations.     Tables  showing  the  value  of    S(T  )  when  so  numbered 
have  also  been  prepared.  •  •'.-.. 


to  all  other  factors  has  been  made  upon  the  basis  that  we  have  no 
measurement  of  those  factors  other  'than  the  'passage  of  time.     If,  now,.; 
we  have  measurements'  of  the  'supply,'  we  have  two  sets  of  measurements 
on  independent  factors,  the  simply  series.,  A    and  tne  othe rw i s e  unmeasured 
factors,  T.     And  it  is  to  be  particularly  noted  that  T  in  this  case 
represents  a  different  group  of  factors  than  in  the  earlier  case  where 
we  had  no  measvirement  of  A,  since    A  has  been  removed  froxn  the  "otherwise 
unmeasured'11  group.     The  effect  of    T,  therefore,   is  not  necessarily 


series  is  .'an  annual  -or  monthly  series. 


This  computation  of  the  relation  of  the  dependent  (price)  series 


97. 

the  sane.    We  now  have  a  multiple  correlation  problem  in  which    X  is 
the  dependent  and    A    and    T    the  independents.    Usual  multiple  correla- 
tion technique  is  applicable.     The  values  of    b    are  determined  in 
the  "f  q  13.  owing , 

X    a    li  A  +    b2  T  *  %    (94) 

and  interpreted  with  strict  reference  to  not  only  the  implicit  assumptions 
of  the  equation,  but  also  to  the  assumptions  involved  in  using    T    as  a 
statistical  measurement. 

Not  only  are  many  time  series  characterized  by  what  nay  be  termed 
"trend"  movements,  "nit  many  are  also  characterized  by  what  may  be  termed 
11  seasonal"  movement.    Just  as  trend  movements  arise  from  gradually 
changing  environment,  seasonal  movements  arise  from  a  changing  and  re- 
curring environment .    Thus,  crops  are  marketed  only  in  certah»  times 
during  the  year;  building  is  more  active  in  certain  months  than  in  others, 
retail  sales  have  pronounced  increases  in  holiday  seasons,  prices  often 
reflect  changes  of  a  systematic  nature  as  we  pass  through  the  year. 
Just  as  in  the  case  of  trend,  it  is  equally  impossible  to  measure  all 
the  factors  which  contribute  tc  seasonal  movements;  and  hence,  in  analagons 
fashion,  the  influence  of  such  factors  is  arbitrarily  said  to  be  some 
function  of  the  proportion  of  the  year  which  has  passed — is  some  function 
of  a  series  with  values  from  one  to  twelve,  recurring  each  year.  And, 
.just"  as  it  was  found  that  in  the  analysis  of  the  relation  of  a  given 
dependent  series  to  an  independent  time  series  it  was  desirable  to  include 
a  trend  measurement,  so  also  is  it  desirable  to  include  a  seasonal 
measurement . 

Since,  however,  the  curve  represent ing  the  influence  of  the  season 
upon  the  dependent  must  come  bach  to  itc  starting  level,  linear  correlation 


93. 

methods  are  inadequate.     Curvilinear  correlation  methods  should  "be 
employed.     Thus  to     (94)     should  "be  added  a  terra  in    S,  seasonal,  so 
that  it  reads, 

X  =    I  +  li  A  +  £2  t  .  . .     +  s    (95) 

This  method  of  handling  trend  and  seasonal  in  the  analyses  of 

time  series    — '  may  be  termed  a  simultaneous,  determination  of  trend  and 

seasonal.     In  passing  it  may  be  pointed  out  that  this  method  differs 

from  procedure  followed  by  many  investigators,  with  whom  it  is  customary 

to  "extract  trend  and  seasonal"  from  all  series  and  then  correlate 

residuals.     In  reconciliation  of  the  two  methods  it  may  be  pointed  out, 

however,  that  absolute  errors  of  estimating  from  regression  equations 

will  in  both  cases  be  identical. 

There  is  another  problem  which  is  more  acute  in  the  analysis  of 

time  series  than  in  many  other  analyses:     that  of  a  changing  relationship 

as  we  pass  through  time.    Thus  the  relation  of  supply  to  price  of  a 

commodity  may  have  changed  considerably  during  the  past  decade  because 

of  the  introduction  of  substitutes,  shifts  in  styles,  major  changes  in 

costs,  etc.    Again,  there  may  have  been  a  pronounced  change  in  the  normal 

seasonal  curve,  owing  to  the  introduction  of  storage  facilities  or  other 

numerous  factors.     This  type  of  change  can  he  adeouately  delineated  "by 

application  of  the  methods  cited  in  the  discussion  of  joint  relationships, 

for  it  is  essentially  a  problem  of  the  joint  effect  of    T    and  the  given 

variable.     Thus,  all  that  is  required  is  to  obtain  the  surfaces  represented 

167  ? 

Sep  also  Smith,  3.  B.     The  Error  in  Eliminating  Secular  Trend  and  Sea- 
sonal Variation  before  Correlating  Time  Series.     Amer.  Statis.  Assoc.  Jour. 
20:     C43-545.  1925. 


99. 

by    F    in  the  following! 

X  =  H  +  y2^W  +    ^3  (T)  +  ...     +      fn  (n,T)    (9b) 

To  accomplish  this  first  solve  for    F  in 

<*)  +  V     (S)  ii3'   (T)  +  -v..  +  ?n'   (n)    (37) 

by  pre-desoribed  methods ,  and  then  by  conjoining  the  values  of    T  with 

those  of  each  of  the  other  variables,  as  described  in  connection  with  joint 

relationships,  determine  the  surfaces,    3?    in  (?S) . 

In  the  analysis  of  historical  'orice  series,  it  is  sometimes  desirable 

to  use  "uncle  flat  od"  prices  'as'  the  dependent  variable,  and  find  the  best 

adjustment-,  for  price  level  by  using  the  inde:c  of  price  level  as  one  of 

the  independent  variables.     After  this  relation  has  been  determined  it  nay 

be  desired  to  find  the  correlation  of  the  -rice  adjusted  for  price  level 

according  to  the  relation  found.    That  is,  instead  of  finding  the  multiple 

correlation  of  the  independent  factors  (including  price  level)  with  price, 

17  / 

the  object  is  to  take       Price  as  the  dependent  variable. —      In  this 

f  (price  level) 

and  in  similar  cases  it  may  be  desirable  to  correct  the  der-endent  for  the 

influence  of  but  one  independent  as  sho^n  by  the  net  regression  equation, 

and  then  ascertain  the  multiole  correlation  between  the  dependent  so  treated 

and  the  remaining  independents, 

.  .  Thus,,  having  found  b'  i<a.-  .•  ..  ..• 

* -9  b    a+  b0  o  +  b„  c+  ...  +  h_  n    (9S) 

it  is  desired  to  find  the  correlation  between 

{%  -  fc,  a)  and  b,  c,    +  ...  +  n 

17/  Ezekiel,  Mordecai .     Tne  Assumptions  Imeliod  in  the  Multiple  Regression 
Equation.     Amer.  Static .  Assoc.  Jour.  20:  105  -  408.  1925. 


ICO. 

In  a  solution  for    to    in  . 

(x  -  ]?t£.)  ~  £2~b  +   bj^c  +    .  . ,  +  n,    (99) 

the  b  constants  will  obviously  take  the  sane  values  as  in  (98)  above, 

since  the  b-,a  term  having  been    subtracted  from    x    prior  to  the  ner; 

correlation,  these  values  of  -  b    are  th.3  only  ones  which  will  give 

minimum  squared,  residuals.    The  residuals  will  thus  obviously  be  the  same 

whether  evaluated  from    (99)     or     (98)     since,  in  effect,  in    (99)  all 

the  terms  are  .  subtract Q4  f-om    x,    just  as  in  (98). 

The  correlation  "R/        ,     »    '  may  then  be  determin- 

(x  -  b-,a)   .be  . . . .  n  0  . 


ed  as-  follows 


°z    =      °x~    (1  "  E'  s.abc  n.y 
and  0~2(x  -  b  a)  -      l.s(x  -  b  a)2 


I  \  -  2b      £(ax)  +  b^.S(a2)j 

n  . . 


=  '    <T?  -  2b v  pax-.  +  •  b/J  0", 

Then  by  a  familiar  theorem 

o 

%  -  li  a) .be. . .n  =  1  - 


J3 


of 


x.abc.  .  .n)  . .  (100) 


Px'-    2b  *pax  +    bx  0~'J 


All  values  necessary  to  the  evaluation  of  the  coefficient  are  procured 
by  the  original  multiple  correlation  solution. 


V 

101. 

There  is  yet  another  pro"blein  which  "becomes  more  significant- 
in  the  correlation  of  time  series  than  in  other  types,  owing  to'  limited 
numbers'  of  observations.     This  is  the  problem  of  the  reliability 
of  results  obtained.     In  general  th<s  coefficient  of  multiple  correla- 
tion has  been  taken,  as  a  measure,  of  this,  reliability..    But  if  v/e 
should  take  crease  in  which  there  were  as  many  independent  variables 
as  there  were  observations,  it  would  obviously  be  possible  to  find 
values  of  net  regressions  which  would  result  in  a  perfect  multiple 
correlation.    The  results,  however,  would  be  worthless  as  an  interpret- 
ation of  underlying  economic  la'*7s.     Thus  the  true,  underlying 
correlation  is  confuted  with  the  probable  correlation  that-  might 
result  from  the  pure  accident  of  numbers  and  the  possibility 
of  adapting  a  certain  number  of  functions  to  purely  random  series, 
A  correlation  coefficient  nay  be  developed  which  eliminates 
this  purely  mathematical  probability,  and  which  includes  the  number 
of  independent  variables,  m,  as  related  to  the  number  of  observations, 
n,  in  its  expression. 


>'  a  i      ...  ...  lc? 

Thus  from  least  sou-re  theory  (See  Morrinrn.'  "Method  of  Least 

Squares".)    Z    "being  ?  residual,  end    e    the  error  of  estimate, 
'.."<•-.      ,7  •  •  ...  ,    .  •  • 

e2  =  z.(z2)   :.;   (101) 


•  n  -  in 


But  "by  inserting-  the-'  eouiv~lent  ".of  .  S.(ZS)  in  (28),     (93)  becomes 

■    •   .  ,  .  a  ,  2 

e    =  _rC>":     (1  -  R  ) 

......  n  -  ia 

~nd  .  .  .  e    r    0"\/i  1  vr" 

U  •    11 ;    (102) 

or  a  modified  error  6^  estimate  which  rary  bo  written,  2.    If   S  becomes 

greater  than  0£    it  means  that  the  error  of  esrim.?ting.  from  the  re- 
's ' 

gresSion  equation  for  new  cases"  is  greator  th-m  the  standard  deviation, 
end  hence  worse  than  merely  t.?.king  the  average  of  the  dependent  as  the 
est  inr  te , 

If  we  substitute  3    for  the  s't-nd-rd  deviation  of  residuals  in 
a.frmilir-r  formula,,  then  we  may  secure  a  coefficient  of  correlation,  R, 
modified  for  the  ratio  of    m    to  n 

„    x_    1  -  R2   (103) 


1  -  m/n 

which  is  the  requisite  formula .    The  application  of  this  formula  to 
time  series  analyses  will  often  give  the  operator  pause,  and  check 
un;-7  ar r ant  ed  en  thus  i  a  sm . 


U.  S.  Department  of  Agriculture 
Library 


NOTICE  TO  BORROWERS 

Please  return  all  books  promptly  after 
finishing  your  use  of  them,  in  order  that 
they  may  be  available  for  reference  by 
other  persons  who  need  to  use  them. 


Please  do  not  lend  to  others  the  books 
and  periodicals  charged  to  you.  Return 
them  to  the  Library  to  be  charged  to 
the  persons  who  wish  them. 

The  mutilation,  destruction,  or  theft 
of  Library  property  is  punishable  by  law. 
(20  Stat.  171,  June  15,  1878.) 


Ub.  » 

8—7888 


